Parse XML file using Python. multiple hierarchies

Question

I want to parse the following xml file using python. My "folder" variable is set up to always equal the 8-digit number towards the end of the <link> tag. In this case it is 11119709.

Python

for folder in folderList:

I want to be able to say, when "folder" equals the last 8 digits in the link tag, give me what the eq:seconds value is. I tried playing with the code provided by python docs element tree, but I am having trouble with it because there are so many hierarchies. root[0][1].text will not retrieve variables under the item tag.

XML

-<rss xmlns:georss="http://www.georss.org/georss/" xmlns:eq="http://earthquake.usgs.gov/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/"   xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" version="2.0">
  -<channel>
      <title>USGS Earthquake ShakeMaps</title>
      <description>List of ShakeMaps for events in the last 30 days</description>
      <link>http://earthquake.usgs.gov/</link>
      <dc:publisher>U.S. Geological Survey</dc:publisher>
      <pubDate>Thu, 27 Mar 2014 15:33:05 +0000</pubDate>
      <item>
         <title>4.11 - 79.3 miles NNW of Kotzebue</title>
         <description>
         <![CDATA[<img src="http://earthquake.usgs.gov/eqcenter/shakemap/thumbs/shakemap_ak_11199709.jpg" width="100" align="left" hspace="10"/><p>Date: Thu, 27 Mar 2014 07:28:31 UTC<br/>Lat/Lon: 67.9858/-163.494<br/>Depth: 15.9122</p>]]></description>
         <link>http://earthquake.usgs.gov/eqcenter/shakemap/ak/shake/11199709/</link>
         <pubDate>Thu, 27 Mar 2014 07:53:33 +0000</pubDate>
         <geo:lat>67.9858</geo:lat>
         <geo:long>-163.494</geo:long>
         <dc:subject>4</dc:subject>
         <eq:seconds>1395905311</eq:seconds>
         <eq:depth>15.9122</eq:depth>
         <eq:region>ak</eq:region>
         </item>
       <item>
              ...similar to above item

Midnighter · Accepted Answer · 2014-03-28 14:48:36Z

1

If you are worried about speed, I recommend lxml. It has extra dependencies but is usually much faster than BeautifulSoup.

answered Mar 28, 2014 at 14:48

Midnighter

3,9213 gold badges33 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

score 0 · Accepted Answer · 2014-03-28 15:00:39Z

0

Use BeautifulSoup which can parse HTML and XML (with an external module) and is way easier to use than the one included in Python.

This code should do what you want :

from bs4 import BeautifulSoup

xml = BeautifulSoup(open("filename.xml")) # here you load your XML file
# you can also load it from an URL by using "urllib" or "Python-Requests"

# BeautifulSoup(open("filename.xml"), "xml") # if you want to use an XML parser
# see comments below

for folder in folderList:
    for item in xml.findAll("items"): # iterate through all <item> elements
        if folder in item.link.text: # if folder's name is in the <link> element
            print(item.find("eq:seconds").text) # print the <eq:seconds> element

edited Mar 28, 2014 at 15:00

answered Mar 28, 2014 at 14:32

user2629998

6 Comments

Andrew Over a year ago

Thanks! I am not sure I can install beautifulsoup on our server machine but I will look into it.

user2629998 Over a year ago

@Andrew If you have access to a shell (for example via SSH) you can easily install it with PIP : pip install beautifulsoup4, and if you don't have PIP you can extract the beautifulsoup tarball in your script's directory.

mata Over a year ago

BeautifulSoupt is not an xml parser, but it can use lxml for parsing xml. But to do so you need to pass the "xml" argument to the constructor. Otherwise the content it's parsed using a html parser and not as xml!

user2629998 Over a year ago

@mata updated my answer - but personally I never had any issues with parsing XML with BS4's HTML parser.

mata Over a year ago

It's a question of what you want. The html parser happyly parses invalid xml, which usually isn't what you want. The xml parser throws an exception. Also - depending on the used parser - it may wrap the xml in <html></html> tags...

|

Collectives™ on Stack Overflow

Parse XML file using Python. multiple hierarchies

2 Answers 2

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related