1

I am trying to parse a large xml file and print the tags to an output file. I am using minidom, my code is working fine for 30Mb files but for larger ones it is getting memory error. So I used bufferred reading the on file but now I am unable to get the desired output.

XML File

> <File> <TV>Sony</TV> <FOOD>Burger</FOOD> <PHONE>Apple</PHONE> </File>   
> <File> <TV>Samsung</TV> <FOOD>Pizza</FOOD> <PHONE>HTC</PHONE> </File>  
> <File> <TV>Bravia</TV> <FOOD>Pasta</FOOD> <PHONE>BlackBerry</PHONE> </File>  

Desired Output

Sony, Burger, Apple
Samsung, Pizza, HTC
Bravia, Pasta, BlackBerry

When reading with buffer its giving me an output saying :-
Sony, Burger, Apple
Samsung,Piz Bravia, Pasta, BlackBerry

while 1:
    content = File.read(2048)
        if not len(content):
            break
         else:
             for lines in StringIO(content):
                lines = lines.lstrip(' ')
                if lines.startswith("<TV>"):
                   TV =  lines.strip("<TV>")
                   tvVal = TV.split("</TV>")[0]
                   #print tvVal
                   w2.writelines(str(tvVal)+",")
                elif lines.startswith("<FOOD>"):
                   FOOD =  lines.strip("<FOOD>")
                   foodVal = FOOD.split("</FOOD>")[0]
                   #print foodVal
                   w2.writelines(str(foodVal)+",")
                   ............................
                   ...........................

I tried with seek() but still I was unable to get the desired output.

2 Answers 2

1

You're reading in 2048 byte at once, which put the reading cursor in the middle of a line. In the next read, the rest of that line is discard because it doesn't start with a tag.

Instead of rolling your own parser, consider using iterparse. An even faster version of iterparse is included with lxml Here's an example

import cStringIO
from xml.etree.ElementTree import iterparse

fakefile = cStringIO.StringIO("""<temp>
  <email id="1" Body="abc"/>
  <email id="2" Body="fre"/>
  <email id="998349883487454359203" Body="hi"/>
</temp>
""")
for _, elem in iterparse(fakefile):
    if elem.tag == 'email':
        print elem.attrib['id'], elem.attrib['Body']
    elem.clear()
Sign up to request clarification or add additional context in comments.

4 Comments

As i have already said that i used minidom parser and it worked superb but only problem i am having is in large size files and reading it in buffer. I should be checking with <File> </File> tags if the buffer gives me the last line somewhere <FOOD> Pasta; i should rollback and search for </File>.Proceeding with seek() & tell() but all i am confused and landed up here for help.
Iterparse is designed to work with huge files, because you can use it to parse the file iteratively, discard unnecessary information. You don't have to keep the entire file into memory, unlike minidom
import cStringIO from xml.etree.ElementTree import iterparse File = open("File.xml","rb") FRead = File.read() x = cStringIO.StringIO(Fread) for event, elem in iteparse(x): if elem.tag == 'File': print elem.attrib['TV'] elem.clear() This isnt working :(
In your example TV is a child element, not an attribute. And don't read the whole file into a StringIO, it will defeats the purpose of iterparse. Just pass the file to iterparse directly.
1

Thanks for your support and i have finally written my code and its working great here it is

import lxml import etree    
for event, element in etree.iterparse(the_xml_file):
    if 'TV' in element.tag:
        print element.text

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.