1

I have a ~1GB XML file that has XML tags that I need to fetch data from. I have the XML file in the following format (I'm only pasting sample data because the actual file is about a gigabyte in size).

report.xml

<report>
  <report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
  <date-range date="All Time"/>
  <table>
  <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content" avgPosition="1.55" cost="252910000" clicks="11" conv1PerClick="0" impressions="7395" day="2012-04-23" currency="INR" account="Virtual Voyage" timeZone="(GMT+05:30) India Standard Time" viewThroughConv="0"/>

  <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content" avgPosition="1.16" cost="0" clicks="0" conv1PerClick="0" impressions="160" day="2012-04-23" currency="INR" account="Virtual Voyage" timeZone="(GMT+05:30) India Standard Time" viewThroughConv="0"/>

  <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content" avgPosition="1.56" cost="0" clicks="0" conv1PerClick="0" impressions="34" day="2012-04-23" currency="INR" account="Virtual Voyage" timeZone="(GMT+05:30) India Standard Time" viewThroughConv="0"/>

  </table>
</report>
  1. What is the best way to parse/process XML files and fetch the data from xml tags in Python?

  2. Are there any frameworks that can process XML files?

  3. The method needs to be fast; it needs to finish in less than 100 seconds.

I've been using Hadoop with Python to process XML files and it usually takes nearly 200 seconds just to process the data... So I'm looking for an alternative solution that parses the above XML tags and fetches data from the tags.

Here's the data from the tags in the sense:

 campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content" avgPosition="1.16" cost="0" clicks="0" ...

After processing the XML file, I will store the data and values (79057390,3451305670 ...) in a MySQL database. All I need is to be able to process XML files about 1GB in size and save the processed data to a MySQL database in less than 100 seconds.

7
  • 2
    I am on my way to bed, but I am sure someone will be by with more information. I generally use lxml to parse my xml in python. -- here is an article I found helpful awhile back => ibm.com/developerworks/library/x-hiperfparse Commented Nov 30, 2012 at 6:10
  • So you think something else is going to be faster than hadoop cluster? Commented Nov 30, 2012 at 6:21
  • yeah intention is to find another way other than hadoop that reads the xml files very faster and process in to database Commented Nov 30, 2012 at 6:42
  • So you processing you file on some notes (hosts)? I did not work with such cluster..., but as I think, in hadoop you also should use some library for xml processing. What did you do to have 200 sec result? How do you arrange map-reduce process? I have some decisions for big data array processing and used multiprocessing in python to solve the problem for acceptable time. Commented Nov 30, 2012 at 6:42
  • @crow16384: I use hadoop on a single machine.Just prased the xml tags and get the data in mapper file and print the data in reducer file, hadoop will create a text file for u at some given path by u .Now all i want is the process other/faster than hadoop that is particularly used for xml processing in python Commented Nov 30, 2012 at 6:46

1 Answer 1

2

I recently faced a similar problem, the way to solve it for me was to use the iterparse function and lxml, at the end, it is all based on using SAX-like parser instead of a DOM-like one, remember DOM works in memory while SAX is event-driven, so you will save a ton of memory using SAX (and that means time too!, as you will not need to wait to load all the document in order to parse it!)

I think you can use something like this

import xml.etree.cElementTree as ET

file_path = "/path/to/your/test.xml"
context = ET.iterparse(file_path, events=("start", "end")) #Probably we could use only the start tag
# turn it into an iterator
context = iter(context)
on_members_tag = False

for event, elem in context:
    tag = elem.tag
    value = elem.text
    if value :
        value = value.encode('utf-8').strip()       
    if event == 'start' :
        if tag == "row" :
            attribs = elem.attrib
            print "This is the campaignID %s and this is the adGroupID" % (attribs['campaignID'] , attribs['adGroupID'])

    elem.clear() #Save memory!
Sign up to request clarification or add additional context in comments.

8 Comments

:actually as u can observe in xml tags above, tag has only one style like <row/>, so how to use/fetch the data from these tags by the above mentioned method ?
@Kouripm Sorry, I do not understand the question? the code I provided attemps to get a couple of attributes (as an example) and will only took attributes from the row tag, as the row tag does not have content itself appart from the attributes, which data do you want to get?
sorry actually u are right i want to get only the attributes from the tag,and yup u provided the code to get attributes from tags thats fine,but i got confused that actually the tag format will be like <row></row>, but here i had only <row/>....anyway will try now and will let u know further errors if got any(as xml file is too large of size 1GB)
@Kouripm I parse a file 1GB every week with this technique and it goes well, actually the xml I parse is way more complex that yours, you will need to dump the contents into a csvlike file so you can load it directly into mysql in one command.
k,i have processed the xml file of 1GB by the above method and saved that in to csv as u indicated , but it taking a long time(system is hanging up), k after that i have used LOAD INFILE command in mysql to store data from csv to database as below
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.