0

I have the following code to parse an XML but it just won't let me iterate through the children:

import urllib, urllib2, re, time, os
import xml.etree.ElementTree as ET 

def wgetUrl(target):
    try:
        req = urllib2.Request(target)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
        response = urllib2.urlopen(req)
        outtxt = response.read()
        response.close()
    except:
        return ''
    return outtxt

newUrl = 'http://feeds.rasset.ie/rteavgen/player/playlist?showId=10056467'

data = wgetUrl(newUrl)
tree = ET.fromstring(data)
#tree = ET.parse(data)
for elem in tree.iter('entry'):
    print elem.tag, elem.attrib

Now, If I remove 'entry' from the iter I get an output like this (Why the URL??):

{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}id {}
{http://www.w3.org/2005/Atom}published {}
{http://www.w3.org/2005/Atom}updated {}
{http://www.w3.org/2005/Atom}title {'type': 'text'}

But, If I put the iter statement like this it still does not find the children to entry:

for elem in tree.iter('{http://www.w3.org/2005/Atom}entry'):
    print elem.tag, elem.attrib

I still only get the entry element on it's own, not the children:

{http://www.w3.org/2005/Atom}entry {}

Any idea what I am doing wrong?

I have searched everywhere but can't figure this out... I am new to all this so sorry if it is something stupid.

1 Answer 1

1

If you are parsing a Atom feed, you really want to use the feedparser library instead, which takes care of all these details for you and many more.

The {http://www.w3.org/2005/Atom} part is a namespace. You need to specify that namespace to select the entry tags:

for elem in tree.iterfind('ns:entry', {'ns': 'http://www.w3.org/2005/Atom'}):

where I used a dictionary to map the ns: prefix to the namespace, or you can use the same curly braces syntax:

for elem in tree.iterfind('{http://www.w3.org/2005/Atom}entry'):

Once you have the element, you still need to explicitly find it's children:

for elem in tree.iterfind('{http://www.w3.org/2005/Atom}entry'):
    for child in elem:
        print child
Sign up to request clarification or add additional context in comments.

8 Comments

Even if I use for elem in tree.iterfind('{w3.org/2005/Atom}entry'): print elem.tag, elem.attrib it still doesn't iterate down to the children e.g. (<id>, <published>, <updated>, <title> etc.). Any idea why?
@user1995132: Yes, you are searching for entry only, it won't find the children then. You are asking for entry tags, not id or published or updated or title tags.
Even with tree.iter('{w3.org/2005/Atom}entry') it didn't work so when I saw your example I tried iterfind but same result..
@user1995132: Just tested against that feed, I find the one element with iterfind() just fine.
@user1995132: Did I mention that using feedparser would be much easier already?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.