How to process xml files in python

Question

I have a ~1GB XML file that has XML tags that I need to fetch data from. I have the XML file in the following format (I'm only pasting sample data because the actual file is about a gigabyte in size).

report.xml

<report>
  <report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
  <date-range date="All Time"/>
  <table>
  <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content" avgPosition="1.55" cost="252910000" clicks="11" conv1PerClick="0" impressions="7395" day="2012-04-23" currency="INR" account="Virtual Voyage" timeZone="(GMT+05:30) India Standard Time" viewThroughConv="0"/>

  <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content" avgPosition="1.16" cost="0" clicks="0" conv1PerClick="0" impressions="160" day="2012-04-23" currency="INR" account="Virtual Voyage" timeZone="(GMT+05:30) India Standard Time" viewThroughConv="0"/>

  <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content" avgPosition="1.56" cost="0" clicks="0" conv1PerClick="0" impressions="34" day="2012-04-23" currency="INR" account="Virtual Voyage" timeZone="(GMT+05:30) India Standard Time" viewThroughConv="0"/>

  </table>
</report>

What is the best way to parse/process XML files and fetch the data from xml tags in Python?
Are there any frameworks that can process XML files?
The method needs to be fast; it needs to finish in less than 100 seconds.

I've been using Hadoop with Python to process XML files and it usually takes nearly 200 seconds just to process the data... So I'm looking for an alternative solution that parses the above XML tags and fetches data from the tags.

Here's the data from the tags in the sense:

 campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content" avgPosition="1.16" cost="0" clicks="0" ...

After processing the XML file, I will store the data and values (79057390,3451305670 ...) in a MySQL database. All I need is to be able to process XML files about 1GB in size and save the processed data to a MySQL database in less than 100 seconds.

I am on my way to bed, but I am sure someone will be by with more information. I generally use lxml to parse my xml in python. -- here is an article I found helpful awhile back => ibm.com/developerworks/library/x-hiperfparse — matchew
– matchew, Commented Nov 30, 2012 at 6:10
So you think something else is going to be faster than hadoop cluster? — specialscope
– specialscope, Commented Nov 30, 2012 at 6:21
yeah intention is to find another way other than hadoop that reads the xml files very faster and process in to database — Shiva Krishna Bavandla
– Shiva Krishna Bavandla, Commented Nov 30, 2012 at 6:42
So you processing you file on some notes (hosts)? I did not work with such cluster..., but as I think, in hadoop you also should use some library for xml processing. What did you do to have 200 sec result? How do you arrange map-reduce process? I have some decisions for big data array processing and used multiprocessing in python to solve the problem for acceptable time. — crow16384
– crow16384, Commented Nov 30, 2012 at 6:42
@crow16384: I use hadoop on a single machine.Just prased the xml tags and get the data in mapper file and print the data in reducer file, hadoop will create a text file for u at some given path by u .Now all i want is the process other/faster than hadoop that is particularly used for xml processing in python — Shiva Krishna Bavandla
– Shiva Krishna Bavandla, Commented Nov 30, 2012 at 6:46

Juan Antonio Gomez Moriano · Accepted Answer · 2012-11-30 06:43:19Z

2

I recently faced a similar problem, the way to solve it for me was to use the iterparse function and lxml, at the end, it is all based on using SAX-like parser instead of a DOM-like one, remember DOM works in memory while SAX is event-driven, so you will save a ton of memory using SAX (and that means time too!, as you will not need to wait to load all the document in order to parse it!)

I think you can use something like this

import xml.etree.cElementTree as ET

file_path = "/path/to/your/test.xml"
context = ET.iterparse(file_path, events=("start", "end")) #Probably we could use only the start tag
# turn it into an iterator
context = iter(context)
on_members_tag = False

for event, elem in context:
    tag = elem.tag
    value = elem.text
    if value :
        value = value.encode('utf-8').strip()       
    if event == 'start' :
        if tag == "row" :
            attribs = elem.attrib
            print "This is the campaignID %s and this is the adGroupID" % (attribs['campaignID'] , attribs['adGroupID'])

    elem.clear() #Save memory!

edited Nov 30, 2012 at 6:43

answered Nov 30, 2012 at 6:35

Juan Antonio Gomez Moriano

13.9k11 gold badges51 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Shiva Krishna Bavandla Over a year ago

:actually as u can observe in xml tags above, tag has only one style like <row/>, so how to use/fetch the data from these tags by the above mentioned method ?

Juan Antonio Gomez Moriano Over a year ago

@Kouripm Sorry, I do not understand the question? the code I provided attemps to get a couple of attributes (as an example) and will only took attributes from the row tag, as the row tag does not have content itself appart from the attributes, which data do you want to get?

Shiva Krishna Bavandla Over a year ago

sorry actually u are right i want to get only the attributes from the tag,and yup u provided the code to get attributes from tags thats fine,but i got confused that actually the tag format will be like <row></row>, but here i had only <row/>....anyway will try now and will let u know further errors if got any(as xml file is too large of size 1GB)

Juan Antonio Gomez Moriano Over a year ago

@Kouripm I parse a file 1GB every week with this technique and it goes well, actually the xml I parse is way more complex that yours, you will need to dump the contents into a csvlike file so you can load it directly into mysql in one command.

Shiva Krishna Bavandla Over a year ago

k,i have processed the xml file of 1GB by the above method and saved that in to csv as u indicated , but it taking a long time(system is hanging up), k after that i have used LOAD INFILE command in mysql to store data from csv to database as below

|

Collectives™ on Stack Overflow

How to process xml files in python

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related