1

I have a csv file that contains XML entries. Imagine that each XML entry starts with <entry> and ends with </entry>. There are thousands of these entries in my file. Each XML entry consists of nested XML elements.

I need to extract some elements of each entry and save them into another file by Python. Here is a sample of one XML entry. Imagine that I want to extract and elements of each entry. could you please advise me how I can do this in Python? I'm a beginner in Python programming.

"<entry xmlns=""http://www.w3.org/2005/Atom"" xmlns:gnip=""http://www.gnip.com/schemas/2010"">
  <id>tag:search.twitter.com,2005:157796632933576704</id>
  <published>2012-01-13T12:10:23+00:00</published>
  <updated>2012-01-13T12:10:23+00:00</updated>
  <summary type=""html"">RT @sprice54: If you rearrange the words ""Debit card"" you can spell ""Bad Credit""</summary>
  <link rel=""alternate"" type=""text/html"" href=""http://twitter.com/GCordivari/statuses/157796632933576704""/>
  <source>
    <link rel=""self"" type=""application/json"" href=""https://stream.twitter.com/1/statuses/filter.json""/>
    <title>Twitter - Stream - Track</title>
    <updated>2012-01-13T12:10:25Z</updated>
  </source>
  <service:provider xmlns:service=""http://activitystrea.ms/service-provider"">
    <name>Twitter</name>
    <uri>http://www.twitter.com/</uri>
    <icon/>
  </service:provider>
  <contributor>
    <name>Steve Price</name>
    <uri>http://www.twitter.com/sprice54</uri>
  </contributor>
  <link rel=""via"" type=""text/html"" href=""http://twitter.com/sprice54/statuses/157748462321012736""/>
  <title>George Cordivari shared: Steve Price posted a note on Twitter</title>
  <category term=""StatusShared"" label=""Status Shared""/>
  <category term=""NoteShared"" label=""Note Shared""/>
  <activity:verb xmlns:activity=""http://activitystrea.ms/spec/1.0/"">http://activitystrea.ms/schema/1.0/share</activity:verb>
  <activity:object xmlns:activity=""http://activitystrea.ms/spec/1.0/"">
    <activity:object-type>http://activitystrea.ms/schema/1.0/note</activity:object-type>
    <id>object:search.twitter.com,2005:157796632933576704</id>
    <content type=""html"">RT @sprice54: If you rearrange the words ""Debit card"" you can spell ""Bad Credit""</content>
    <link rel=""alternate"" type=""text/html"" href=""http://twitter.com/GCordivari/statuses/157796632933576704""/>
  </activity:object>
  <author>
    <name>George Cordivari</name>
    <uri>http://www.twitter.com/GCordivari</uri>
  </author>
  <activity:author xmlns:activity=""http://activitystrea.ms/spec/1.0/"">
    <activity:object-type>http://activitystrea.ms/schema/1.0/person</activity:object-type>
    <gnip:friends xmlns:gnip=""http://www.gnip.com/schemas/2010"" followersCount=""37"" followingCount=""61""/>
    <link rel=""alternate"" type=""text/html"" length=""0"" href=""http://www.twitter.com/GCordivari""/>
    <link rel=""avatar"" href=""http://a0.twimg.com/profile_images/1670548060/274805_1268643462_1179159089_n_normal.jpg""/>
    <id>http://www.twitter.com/GCordivari</id>
  </activity:author>
  <activity:actor xmlns:activity=""http://activitystrea.ms/spec/1.0/"">
    <activity:object-type>http://activitystrea.ms/schema/1.0/person</activity:object-type>
    <gnip:friends xmlns:gnip=""http://www.gnip.com/schemas/2010"" followersCount=""37"" followingCount=""61""/>
    <gnip:stats xmlns:gnip=""http://www.gnip.com/schemas/2010"" activityCount=""370"" upstreamId=""id:twitter.com:427031045""/>
    <link rel=""alternate"" type=""text/html"" length=""0"" href=""http://www.twitter.com/GCordivari""/>
    <link rel=""avatar"" href=""http://a0.twimg.com/profile_images/1670548060/274805_1268643462_1179159089_n_normal.jpg""/>
    <id>http://www.twitter.com/GCordivari</id>
    <os:location xmlns:os=""http://ns.opensocial.org/2008/opensocial"">Drexel Hell</os:location>
    <os:aboutMe xmlns:os=""http://ns.opensocial.org/2008/opensocial"">This is the way I live. #CirocInMyCupIDGAF #CloudNine  #FollowMeLikeTheLeader </os:aboutMe>
  </activity:actor>
  <gnip:twitter_entities xmlns:gnip=""http://www.gnip.com/schemas/2010"">
    <user_mentions>
      <user_mention start=""3"" end=""12"">
        <id>255347428</id>
        <name>Steve Price</name>
        <screen_name>sprice54</screen_name>
      </user_mention>
    </user_mentions>
  </gnip:twitter_entities>
  <gnip:matching_rules>
    <gnip:matching_rule rel=""inferred"">""debit card""</gnip:matching_rule>
  </gnip:matching_rules>
</entry>"

3 Answers 3

1

Use the csv module to parse the csv and something like elementtree to parse the xml fields.

If your xml data is RSS-compatible look at feedparser.

Sign up to request clarification or add additional context in comments.

Comments

1

Python has a number of really great xml parsing utilities. BeautifulSoup is very popular because it has a simple api. http://www.crummy.com/software/BeautifulSoup/doc/

lmxml is a great library for very fast xml parsing, but requires libxml

There are plenty of tutorials online that explain step by step the basics of parsing xml with python . http://www.learningpython.com/2008/05/07/elegant-xml-parsing-using-the-elementtree-module/

Comments

0

Following the examples in the docs here is how you could extract all named elements, say contributors and export them to a new XML document.

import xml.dom.minidom as minidom

#open the input csv/xml file
inputPath = '/path/to/xml.csv'
xml_csv = open(inputPath)

#open a output file in write mode
outputPath = '/path/to/contributors.xml'
outxml = open(outputPath,'w')

#create a new xml document and top level element
impl = minidom.getDOMImplementation()
newxml = impl.createDocument(None,'contributors',None)
top = newxml.documentElement

#loop through each line in the file splitting on commas
for line in xml_csv:
    xmlFields = line.split(',')

    for fldxml in xmlFields:
        #double double quotes caused the parser to choke, I'm replacing them here
        fldxml = fldxml.replace('""','"')

        #parse the xml data from each field and 
        #find all contributor elements under the top level
        dom = minidom.parseString(xmlfld)
        contributors = entry.getElementByTagName('contributor')

        #add each contributor to the new xml document
        for contributor in contributors:
            top.appendChild(contributor)

#write out the new xml contributors document in pretty XML
outxml.write(newxml.toprettyxml())
outxml.close()

3 Comments

Thanks, Tharen. I get this error when I run your code:"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1924, in parseString return expatbuilder.parseString(string) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString return builder.parseString(string) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString parser.Parse(string, True) xml.parsers.expat.ExpatError: syntax error
It is saying the parser failed to parse the string it was provided. I'd suspect that the XML is malformed, or is otherwise not understood by the builtin parser. You may have better luck with other parsers. If you have no control over the XML I'd suggest trying something else.
You described your data as "a csv file that contains XML entries" which I took to mean '[xmldata],[xmldata],...'. Where xmldata included <entry>...</entry>. If this is incorrect you will need to provide more context.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.