Parsing text from XML node in Python

Question

I'm trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz

I've unzipped and saved the .xml.gz file as an .xml file. The structure looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
    <loc>https://www.bestbuy.com/</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647</loc>
    <priority>0.0</priority>
</url>

I'm attempting to use ElementTree to extract all of the URLs within the loc nodes throughout this file, but struggling to get it working right.

Per the documentation, I'm trying something like this:

import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')
root = tree.getroot()

value = root.findall(".//loc")

However, nothing gets loaded into value. My goal is to extract all of the URLs between the loc nodes and print it out into a new flat file. Where am I going wrong?

You are not taking namespaces into account. See docs.python.org/3/library/… — mzjn
– mzjn, Commented Oct 17, 2018 at 5:32

Daniel Haley · Accepted Answer · 2019-04-04 15:16:11Z

3

You were close in your attempt but like mzjn said in a comment, you didn't account for the default namespace (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9").

Here's an example of how to account for the namespace:

import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')

ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

for elem in tree.findall(".//sm:loc", ns):
    print(elem.text)

output:

https://www.bestbuy.com/
https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008
https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647

Note that I used the namespace prefix sm, but you could use any NCName.

See here for more information on parsing XML with namespaces in ElementTree.

edited Apr 4, 2019 at 15:16

answered Oct 17, 2018 at 22:53

Daniel Haley

53.1k7 gold badges75 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

LeKhan9 · Accepted Answer · 2018-10-17 05:25:58Z

2

We can iterate through the URLs, toss them into a list and write them to a file as such:

from xml.etree import ElementTree as ET

tree = ET.parse('test.xml')
root = tree.getroot()

name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'

urls = []
for child in root.iter():
    for block in child.findall('{}url'.format(name_space)):
        for url in block.findall('{}loc'.format(name_space)):
            urls.append('{}\n'.format(url.text))

with open('sample_urls.txt', 'w+') as f:
    f.writelines(urls)

note we need to append the name space from the open urlset definition to properly parse the xml

edited Oct 17, 2018 at 5:25

answered Oct 17, 2018 at 4:51

LeKhan9

1,3501 gold badge7 silver badges15 bronze badges

2 Comments

tsb8m Over a year ago

It doesn't work, my urls array is still empty. Not sure if there's a formatting issue with the actual XML file I'm trying to open? I'm scraping the .xml.gz file like the one I linked to and using GzipFile to unzip it.

LeKhan9 Over a year ago

Right, I think I chopped some important info off with my test file, and appending it in the parsing should help. I updated the answer.

Collectives™ on Stack Overflow

Parsing text from XML node in Python

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related