XML PARSER -Parsing a large file for a particular format output

Question

I am trying to parse a large xml file and print the tags to an output file. I am using minidom, my code is working fine for 30Mb files but for larger ones it is getting memory error. So I used bufferred reading the on file but now I am unable to get the desired output.

XML File

> <File> <TV>Sony</TV> <FOOD>Burger</FOOD> <PHONE>Apple</PHONE> </File>   
> <File> <TV>Samsung</TV> <FOOD>Pizza</FOOD> <PHONE>HTC</PHONE> </File>  
> <File> <TV>Bravia</TV> <FOOD>Pasta</FOOD> <PHONE>BlackBerry</PHONE> </File>

Desired Output

Sony, Burger, Apple
Samsung, Pizza, HTC
Bravia, Pasta, BlackBerry

When reading with buffer its giving me an output saying :-
Sony, Burger, Apple
Samsung,Piz Bravia, Pasta, BlackBerry

while 1:
    content = File.read(2048)
        if not len(content):
            break
         else:
             for lines in StringIO(content):
                lines = lines.lstrip(' ')
                if lines.startswith("<TV>"):
                   TV =  lines.strip("<TV>")
                   tvVal = TV.split("</TV>")[0]
                   #print tvVal
                   w2.writelines(str(tvVal)+",")
                elif lines.startswith("<FOOD>"):
                   FOOD =  lines.strip("<FOOD>")
                   foodVal = FOOD.split("</FOOD>")[0]
                   #print foodVal
                   w2.writelines(str(foodVal)+",")
                   ............................
                   ...........................

I tried with seek() but still I was unable to get the desired output.

Kien Truong · Accepted Answer · 2013-03-29 09:26:54Z

1

You're reading in 2048 byte at once, which put the reading cursor in the middle of a line. In the next read, the rest of that line is discard because it doesn't start with a tag.

Instead of rolling your own parser, consider using iterparse. An even faster version of iterparse is included with lxml Here's an example

import cStringIO
from xml.etree.ElementTree import iterparse

fakefile = cStringIO.StringIO("""<temp>
  <email id="1" Body="abc"/>
  <email id="2" Body="fre"/>
  <email id="998349883487454359203" Body="hi"/>
</temp>
""")
for _, elem in iterparse(fakefile):
    if elem.tag == 'email':
        print elem.attrib['id'], elem.attrib['Body']
    elem.clear()

answered Mar 29, 2013 at 9:26

Kien Truong

11.4k2 gold badges34 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

vivs Over a year ago

As i have already said that i used minidom parser and it worked superb but only problem i am having is in large size files and reading it in buffer. I should be checking with <File> </File> tags if the buffer gives me the last line somewhere <FOOD> Pasta; i should rollback and search for </File>.Proceeding with seek() & tell() but all i am confused and landed up here for help.

Kien Truong Over a year ago

Iterparse is designed to work with huge files, because you can use it to parse the file iteratively, discard unnecessary information. You don't have to keep the entire file into memory, unlike minidom

vivs Over a year ago

import cStringIO         from xml.etree.ElementTree import iterparse         File = open("File.xml","rb")         FRead = File.read()         x = cStringIO.StringIO(Fread)         for event, elem in iteparse(x):             if elem.tag == 'File':                  print elem.attrib['TV']              elem.clear()

This isnt working :(

Kien Truong Over a year ago

In your example TV is a child element, not an attribute. And don't read the whole file into a StringIO, it will defeats the purpose of iterparse. Just pass the file to iterparse directly.

vivs · Accepted Answer · 2013-04-08 15:51:09Z

1

Thanks for your support and i have finally written my code and its working great here it is

import lxml import etree    
for event, element in etree.iterparse(the_xml_file):
    if 'TV' in element.tag:
        print element.text

answered Apr 8, 2013 at 15:51

vivs

1131 gold badge1 silver badge8 bronze badges

Collectives™ on Stack Overflow

XML PARSER -Parsing a large file for a particular format output

XML File

Desired Output

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

XML File

Desired Output

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related