0

I have following XML.

<component name="QUESTIONS">
    <topic name="Chair"> 
        <state>active</state> 
        <subtopic name="Wooden">
            <links> 
                <link videoDuration="" youtubeId="" type="article">
                    <label>Understanding Wooden Chair</label>
                    <url>http://abcd.xyz.com/1111?view=app</url>
                </link> 
                <link videoDuration="" youtubeId="" type="article">
                    <label>How To Assemble Wooden CHair</label>
                    <url>http://abcd.xyz.com/2222?view=app</url>
                </link> 
                <link videoDuration="11:35" youtubeId="Qasefrt09_2" type="video">
                    <label>Wooden Chair Tutorial</label>
                    <url>/</url>
                </link> 
                <link videoDuration="1:06" youtubeId="MSDVN235879" type="video">
                    <label>How To Access Wood</label>
                    <url>/</url>
                </link> 
            </links>
        </subtopic>
    </topic> 
    <topic name="Table"> 
        <state>active</state> 
        <subtopic name="">
            <links> 
                <link videoDuration="" youtubeId="" type="article">
                    <label>Understanding Tables</label>
                    <url>http://abcd.xyz.com/3333?view=app</url>
                </link> 
                <link videoDuration="" youtubeId="" type="article">
                    <label>Set-up Table</label>
                    <url>http://abcd.xyz.com/4444?view=app</url>
                </link> 
                <link videoDuration="" youtubeId="" type="article">
                    <label>How To Change table</label>
                    <url>http://abcd.xyz.com/5555?view=app</url>
                </link> 
            </links>
        </subtopic> 
    </topic> 
</component>

I am trying to parse this xml in python and creating an URL array which will contain: 1. All the http urls present in the xml 2. For the link tab if youtube is present then capture that and prepare youtube url and add it to URL array.

I have following code, but it is not giving me url and links.

from xml.etree import ElementTree

with open('faq.xml', 'rt') as f:
    tree = ElementTree.parse(f)

for node in tree.iter():
    print node.tag, node.attrib.get('url')

for node in tree.iter('outline'):
    name = node.attrib.get('link')
    url = node.attrib.get('url')
    if name and url:
        print '  %s :: %s' % (name, url)
    else:
        print name

How can I achieve this to get all urls?

developed the following code based on below answers: Problem with following is, it is printing just 1 url not all.

from xml.etree import ElementTree

def fetch_faq_urls():
    url_list = []
    with open('faq.xml', 'rt') as f:
        tree = ElementTree.parse(f)

    for link in tree.iter('link'):
        youtube = link.get('youtubeId')
        if youtube:
            print "https://www.youtube.com/watch?v=" + youtube
            video_url = "https://www.youtube.com/watch?v=" + youtube
            url_list.append(video_url)
            # print "youtubeId", link.find('label').text, '???'
        else:
            print link.find('url').text
            article_url = link.find('url').text
            url_list.append(article_url)
            # print 'url', link.find('label').text, 
      return url_list

faqs = fetch_faq_urls()
print faqs

2 Answers 2

1

The information you want is under <link> so just iterate through those. Use get() to get the youtube id and find() to get the child <url> object.

from xml.etree import ElementTree

with open('faq.xml', 'rt') as f:
    tree = ElementTree.parse(f)

for link in tree.iter('link'):
    youtube = link.get('youtubeId')
    if youtube:
        print "youtube", link.find('label').text, '???'
    else:
        print 'url', link.find('label').text, link.find('url').text
Sign up to request clarification or add additional context in comments.

4 Comments

thanks a lot. I got the idea what should I need to do. Thank you tdelaney
Updated my question with developed code? I am not getting why it is pushing only 1 value to array?
@in_learning_software - That's just a minor indentation problem. Notice that your return url_list is in the for block so it gets executed on the first pass of the loop. Simply dedent to the next higher level.
o o o .. Thanks a lot!
0

Take a look at xmltodict.

>>> print(json.dumps(xmltodict.parse("""
...  <mydocument has="an attribute">
...    <and>
...      <many>elements</many>
...      <many>more elements</many>
...    </and>
...    <plus a="complex">
...      element as well
...    </plus>
...  </mydocument>
...  """), indent=4))
{
    "mydocument": {
        "@has": "an attribute", 
        "and": {
            "many": [
                "elements", 
                "more elements"
            ]
        }, 
        "plus": {
            "@a": "complex", 
            "#text": "element as well"
        }
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.