How to parse XML of nested tags in Python

Question

I have following XML.

<component name="QUESTIONS">
    <topic name="Chair"> 
        <state>active</state> 
        <subtopic name="Wooden">
            <links> 
                <link videoDuration="" youtubeId="" type="article">
                    <label>Understanding Wooden Chair</label>
                    <url>http://abcd.xyz.com/1111?view=app</url>
                </link> 
                <link videoDuration="" youtubeId="" type="article">
                    <label>How To Assemble Wooden CHair</label>
                    <url>http://abcd.xyz.com/2222?view=app</url>
                </link> 
                <link videoDuration="11:35" youtubeId="Qasefrt09_2" type="video">
                    <label>Wooden Chair Tutorial</label>
                    <url>/</url>
                </link> 
                <link videoDuration="1:06" youtubeId="MSDVN235879" type="video">
                    <label>How To Access Wood</label>
                    <url>/</url>
                </link> 
            </links>
        </subtopic>
    </topic> 
    <topic name="Table"> 
        <state>active</state> 
        <subtopic name="">
            <links> 
                <link videoDuration="" youtubeId="" type="article">
                    <label>Understanding Tables</label>
                    <url>http://abcd.xyz.com/3333?view=app</url>
                </link> 
                <link videoDuration="" youtubeId="" type="article">
                    <label>Set-up Table</label>
                    <url>http://abcd.xyz.com/4444?view=app</url>
                </link> 
                <link videoDuration="" youtubeId="" type="article">
                    <label>How To Change table</label>
                    <url>http://abcd.xyz.com/5555?view=app</url>
                </link> 
            </links>
        </subtopic> 
    </topic> 
</component>

I am trying to parse this xml in python and creating an URL array which will contain: 1. All the http urls present in the xml 2. For the link tab if youtube is present then capture that and prepare youtube url and add it to URL array.

I have following code, but it is not giving me url and links.

from xml.etree import ElementTree

with open('faq.xml', 'rt') as f:
    tree = ElementTree.parse(f)

for node in tree.iter():
    print node.tag, node.attrib.get('url')

for node in tree.iter('outline'):
    name = node.attrib.get('link')
    url = node.attrib.get('url')
    if name and url:
        print '  %s :: %s' % (name, url)
    else:
        print name

How can I achieve this to get all urls?

developed the following code based on below answers: Problem with following is, it is printing just 1 url not all.

from xml.etree import ElementTree

def fetch_faq_urls():
    url_list = []
    with open('faq.xml', 'rt') as f:
        tree = ElementTree.parse(f)

    for link in tree.iter('link'):
        youtube = link.get('youtubeId')
        if youtube:
            print "https://www.youtube.com/watch?v=" + youtube
            video_url = "https://www.youtube.com/watch?v=" + youtube
            url_list.append(video_url)
            # print "youtubeId", link.find('label').text, '???'
        else:
            print link.find('url').text
            article_url = link.find('url').text
            url_list.append(article_url)
            # print 'url', link.find('label').text, 
      return url_list

faqs = fetch_faq_urls()
print faqs

tdelaney · Accepted Answer · 2016-09-29 03:28:03Z

1

The information you want is under <link> so just iterate through those. Use get() to get the youtube id and find() to get the child <url> object.

from xml.etree import ElementTree

with open('faq.xml', 'rt') as f:
    tree = ElementTree.parse(f)

for link in tree.iter('link'):
    youtube = link.get('youtubeId')
    if youtube:
        print "youtube", link.find('label').text, '???'
    else:
        print 'url', link.find('label').text, link.find('url').text

answered Sep 29, 2016 at 3:28

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

born2Learn Over a year ago

thanks a lot. I got the idea what should I need to do. Thank you tdelaney

born2Learn Over a year ago

Updated my question with developed code? I am not getting why it is pushing only 1 value to array?

tdelaney Over a year ago

@in_learning_software - That's just a minor indentation problem. Notice that your return url_list is in the for block so it gets executed on the first pass of the loop. Simply dedent to the next higher level.

born2Learn Over a year ago

o o o .. Thanks a lot!

ge7600 · Accepted Answer · 2016-09-29 05:39:14Z

0

Take a look at xmltodict.

>>> print(json.dumps(xmltodict.parse("""
...  <mydocument has="an attribute">
...    <and>
...      <many>elements</many>
...      <many>more elements</many>
...    </and>
...    <plus a="complex">
...      element as well
...    </plus>
...  </mydocument>
...  """), indent=4))
{
    "mydocument": {
        "@has": "an attribute", 
        "and": {
            "many": [
                "elements", 
                "more elements"
            ]
        }, 
        "plus": {
            "@a": "complex", 
            "#text": "element as well"
        }
    }
}

answered Sep 29, 2016 at 5:39

ge7600

3993 silver badges8 bronze badges

Collectives™ on Stack Overflow

How to parse XML of nested tags in Python

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related