parsing a text/csv file containing XML entries in Python

Question

I have a csv file that contains XML entries. Imagine that each XML entry starts with <entry> and ends with </entry>. There are thousands of these entries in my file. Each XML entry consists of nested XML elements.

I need to extract some elements of each entry and save them into another file by Python. Here is a sample of one XML entry. Imagine that I want to extract and elements of each entry. could you please advise me how I can do this in Python? I'm a beginner in Python programming.

"<entry xmlns=""http://www.w3.org/2005/Atom"" xmlns:gnip=""http://www.gnip.com/schemas/2010"">
  <id>tag:search.twitter.com,2005:157796632933576704</id>
  <published>2012-01-13T12:10:23+00:00</published>
  <updated>2012-01-13T12:10:23+00:00</updated>
  <summary type=""html"">RT @sprice54: If you rearrange the words ""Debit card"" you can spell ""Bad Credit""</summary>
  <link rel=""alternate"" type=""text/html"" href=""http://twitter.com/GCordivari/statuses/157796632933576704""/>
  <source>
    <link rel=""self"" type=""application/json"" href=""https://stream.twitter.com/1/statuses/filter.json""/>
    <title>Twitter - Stream - Track</title>
    <updated>2012-01-13T12:10:25Z</updated>
  </source>
  <service:provider xmlns:service=""http://activitystrea.ms/service-provider"">
    <name>Twitter</name>
    <uri>http://www.twitter.com/</uri>
    <icon/>
  </service:provider>
  <contributor>
    <name>Steve Price</name>
    <uri>http://www.twitter.com/sprice54</uri>
  </contributor>
  <link rel=""via"" type=""text/html"" href=""http://twitter.com/sprice54/statuses/157748462321012736""/>
  <title>George Cordivari shared: Steve Price posted a note on Twitter</title>
  <category term=""StatusShared"" label=""Status Shared""/>
  <category term=""NoteShared"" label=""Note Shared""/>
  <activity:verb xmlns:activity=""http://activitystrea.ms/spec/1.0/"">http://activitystrea.ms/schema/1.0/share</activity:verb>
  <activity:object xmlns:activity=""http://activitystrea.ms/spec/1.0/"">
    <activity:object-type>http://activitystrea.ms/schema/1.0/note</activity:object-type>
    <id>object:search.twitter.com,2005:157796632933576704</id>
    <content type=""html"">RT @sprice54: If you rearrange the words ""Debit card"" you can spell ""Bad Credit""</content>
    <link rel=""alternate"" type=""text/html"" href=""http://twitter.com/GCordivari/statuses/157796632933576704""/>
  </activity:object>
  <author>
    <name>George Cordivari</name>
    <uri>http://www.twitter.com/GCordivari</uri>
  </author>
  <activity:author xmlns:activity=""http://activitystrea.ms/spec/1.0/"">
    <activity:object-type>http://activitystrea.ms/schema/1.0/person</activity:object-type>
    <gnip:friends xmlns:gnip=""http://www.gnip.com/schemas/2010"" followersCount=""37"" followingCount=""61""/>
    <link rel=""alternate"" type=""text/html"" length=""0"" href=""http://www.twitter.com/GCordivari""/>
    <link rel=""avatar"" href=""http://a0.twimg.com/profile_images/1670548060/274805_1268643462_1179159089_n_normal.jpg""/>
    <id>http://www.twitter.com/GCordivari</id>
  </activity:author>
  <activity:actor xmlns:activity=""http://activitystrea.ms/spec/1.0/"">
    <activity:object-type>http://activitystrea.ms/schema/1.0/person</activity:object-type>
    <gnip:friends xmlns:gnip=""http://www.gnip.com/schemas/2010"" followersCount=""37"" followingCount=""61""/>
    <gnip:stats xmlns:gnip=""http://www.gnip.com/schemas/2010"" activityCount=""370"" upstreamId=""id:twitter.com:427031045""/>
    <link rel=""alternate"" type=""text/html"" length=""0"" href=""http://www.twitter.com/GCordivari""/>
    <link rel=""avatar"" href=""http://a0.twimg.com/profile_images/1670548060/274805_1268643462_1179159089_n_normal.jpg""/>
    <id>http://www.twitter.com/GCordivari</id>
    <os:location xmlns:os=""http://ns.opensocial.org/2008/opensocial"">Drexel Hell</os:location>
    <os:aboutMe xmlns:os=""http://ns.opensocial.org/2008/opensocial"">This is the way I live. #CirocInMyCupIDGAF #CloudNine  #FollowMeLikeTheLeader </os:aboutMe>
  </activity:actor>
  <gnip:twitter_entities xmlns:gnip=""http://www.gnip.com/schemas/2010"">
    <user_mentions>
      <user_mention start=""3"" end=""12"">
        <id>255347428</id>
        <name>Steve Price</name>
        <screen_name>sprice54</screen_name>
      </user_mention>
    </user_mentions>
  </gnip:twitter_entities>
  <gnip:matching_rules>
    <gnip:matching_rule rel=""inferred"">""debit card""</gnip:matching_rule>
  </gnip:matching_rules>
</entry>"

Paulo Scardine · Accepted Answer · 2012-02-07 01:21:12Z

1

Use the csv module to parse the csv and something like elementtree to parse the xml fields.

If your xml data is RSS-compatible look at feedparser.

answered Feb 7, 2012 at 1:21

Paulo Scardine

78.2k12 gold badges134 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

dm03514 · Accepted Answer · 2012-02-07 01:23:29Z

1

Python has a number of really great xml parsing utilities. BeautifulSoup is very popular because it has a simple api. http://www.crummy.com/software/BeautifulSoup/doc/

lmxml is a great library for very fast xml parsing, but requires libxml

There are plenty of tutorials online that explain step by step the basics of parsing xml with python . http://www.learningpython.com/2008/05/07/elegant-xml-parsing-using-the-elementtree-module/

answered Feb 7, 2012 at 1:23

dm03514

56.2k18 gold badges117 silver badges147 bronze badges

Comments

tharen · Accepted Answer · 2012-02-07 02:08:11Z

0

Following the examples in the docs here is how you could extract all named elements, say contributors and export them to a new XML document.

import xml.dom.minidom as minidom

#open the input csv/xml file
inputPath = '/path/to/xml.csv'
xml_csv = open(inputPath)

#open a output file in write mode
outputPath = '/path/to/contributors.xml'
outxml = open(outputPath,'w')

#create a new xml document and top level element
impl = minidom.getDOMImplementation()
newxml = impl.createDocument(None,'contributors',None)
top = newxml.documentElement

#loop through each line in the file splitting on commas
for line in xml_csv:
    xmlFields = line.split(',')

    for fldxml in xmlFields:
        #double double quotes caused the parser to choke, I'm replacing them here
        fldxml = fldxml.replace('""','"')

        #parse the xml data from each field and 
        #find all contributor elements under the top level
        dom = minidom.parseString(xmlfld)
        contributors = entry.getElementByTagName('contributor')

        #add each contributor to the new xml document
        for contributor in contributors:
            top.appendChild(contributor)

#write out the new xml contributors document in pretty XML
outxml.write(newxml.toprettyxml())
outxml.close()

answered Feb 7, 2012 at 2:08

tharen

1,32112 silver badges22 bronze badges

3 Comments

saghar Over a year ago

Thanks, Tharen. I get this error when I run your code:"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1924, in parseString return expatbuilder.parseString(string) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString return builder.parseString(string) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString parser.Parse(string, True) xml.parsers.expat.ExpatError: syntax error

tharen Over a year ago

It is saying the parser failed to parse the string it was provided. I'd suspect that the XML is malformed, or is otherwise not understood by the builtin parser. You may have better luck with other parsers. If you have no control over the XML I'd suggest trying something else.

tharen Over a year ago

You described your data as "a csv file that contains XML entries" which I took to mean '[xmldata],[xmldata],...'. Where xmldata included <entry>...</entry>. If this is incorrect you will need to provide more context.

Collectives™ on Stack Overflow

parsing a text/csv file containing XML entries in Python

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related