0

I want to parse the following xml file using python. My "folder" variable is set up to always equal the 8-digit number towards the end of the <link> tag. In this case it is 11119709.

Python

for folder in folderList:

I want to be able to say, when "folder" equals the last 8 digits in the link tag, give me what the eq:seconds value is. I tried playing with the code provided by python docs element tree, but I am having trouble with it because there are so many hierarchies. root[0][1].text will not retrieve variables under the item tag.

XML

-<rss xmlns:georss="http://www.georss.org/georss/" xmlns:eq="http://earthquake.usgs.gov/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/"   xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" version="2.0">
  -<channel>
      <title>USGS Earthquake ShakeMaps</title>
      <description>List of ShakeMaps for events in the last 30 days</description>
      <link>http://earthquake.usgs.gov/</link>
      <dc:publisher>U.S. Geological Survey</dc:publisher>
      <pubDate>Thu, 27 Mar 2014 15:33:05 +0000</pubDate>
      <item>
         <title>4.11 - 79.3 miles NNW of Kotzebue</title>
         <description>
         <![CDATA[<img src="http://earthquake.usgs.gov/eqcenter/shakemap/thumbs/shakemap_ak_11199709.jpg" width="100" align="left" hspace="10"/><p>Date: Thu, 27 Mar 2014 07:28:31 UTC<br/>Lat/Lon: 67.9858/-163.494<br/>Depth: 15.9122</p>]]></description>
         <link>http://earthquake.usgs.gov/eqcenter/shakemap/ak/shake/11199709/</link>
         <pubDate>Thu, 27 Mar 2014 07:53:33 +0000</pubDate>
         <geo:lat>67.9858</geo:lat>
         <geo:long>-163.494</geo:long>
         <dc:subject>4</dc:subject>
         <eq:seconds>1395905311</eq:seconds>
         <eq:depth>15.9122</eq:depth>
         <eq:region>ak</eq:region>
         </item>
       <item>
              ...similar to above item

2 Answers 2

1

If you are worried about speed, I recommend lxml. It has extra dependencies but is usually much faster than BeautifulSoup.

Sign up to request clarification or add additional context in comments.

Comments

0

Use BeautifulSoup which can parse HTML and XML (with an external module) and is way easier to use than the one included in Python.

This code should do what you want :

from bs4 import BeautifulSoup

xml = BeautifulSoup(open("filename.xml")) # here you load your XML file
# you can also load it from an URL by using "urllib" or "Python-Requests"

# BeautifulSoup(open("filename.xml"), "xml") # if you want to use an XML parser
# see comments below

for folder in folderList:
    for item in xml.findAll("items"): # iterate through all <item> elements
        if folder in item.link.text: # if folder's name is in the <link> element
            print(item.find("eq:seconds").text) # print the <eq:seconds> element

6 Comments

Thanks! I am not sure I can install beautifulsoup on our server machine but I will look into it.
@Andrew If you have access to a shell (for example via SSH) you can easily install it with PIP : pip install beautifulsoup4, and if you don't have PIP you can extract the beautifulsoup tarball in your script's directory.
BeautifulSoupt is not an xml parser, but it can use lxml for parsing xml. But to do so you need to pass the "xml" argument to the constructor. Otherwise the content it's parsed using a html parser and not as xml!
@mata updated my answer - but personally I never had any issues with parsing XML with BS4's HTML parser.
It's a question of what you want. The html parser happyly parses invalid xml, which usually isn't what you want. The xml parser throws an exception. Also - depending on the used parser - it may wrap the xml in <html></html> tags...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.