Parsing XML with namespace in Python via 'ElementTree'

Question

I have the following XML which I want to parse using Python's ElementTree:

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

Because of the namespace, I am getting the following error.

SyntaxError: prefix 'owl' not found in prefix map

I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.

Kindly let me know how to change the code to find all the owl:Class tags.

Since Python 3.8, a namespace wildcard can be used with find(), findall() and findtext(). See stackoverflow.com/a/62117710/407651. — mzjn
– mzjn, Commented Jul 19, 2021 at 19:50

Martijn Pieters · Accepted Answer · 2025-04-04 17:51:47Z

270

You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

Also see the Parsing XML with Namespaces section of the ElementTree documentation.

As of Python 3.8, the ElementTree library also understands the {*} namespace wildcard, so root.findall('{*}Class') would also work (but don't do that if your document can have multiple namespaces that define the Class element).

If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in .nsmap attribute on elements and generally has superior namespaces support.

edited Apr 4 at 17:51

answered Feb 13, 2013 at 12:18

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

Kostanos Over a year ago

Thank you. Any idea how can I get the namespace directly from XML, without hard-coding it? Or how can I ignore it? I've tried findall('{*}Class') but it wont work in my case.

Martijn Pieters Over a year ago

You'd have to scan the tree for xmlns attributes yourself; as stated in the answer, lxml does this for you, the xml.etree.ElementTree module does not. But if you are trying to match a specific (already hardcoded) element, then you are also trying to match a specific element in a specific namespace. That namespace is not going to change between documents any more than the element name is. You may as well hardcode that with the element name.

Martijn Pieters Over a year ago

@Jon: register_namespace only influences serialisation, not search.

egpbos Over a year ago

Small addition that may be useful: when using cElementTree instead of ElementTree, findall will not take namespaces as a keyword argument, but rather simply as a normal argument, i.e. use ctree.findall('owl:Class', namespaces).

Wilson F Over a year ago

@Bludwarf: The docs do mention it (now, if not when you wrote that), but you have to read them verrrry carefully. See the Parsing XML with Namespaces section: there's an example contrasting the use of findall without and then with the namespace argument, but the argument is not mentioned as one of the arguments to the method method in the Element object section.

|

Brad Dre · Accepted Answer · 2019-07-30 18:47:24Z

69

Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

UPDATE:

5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.

Here's another case and how I handled it:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

edited Jul 30, 2019 at 18:47

answered Nov 7, 2014 at 18:22

Brad Dre

3,8962 gold badges23 silver badges23 bronze badges

4 Comments

Matti Virkkunen Over a year ago

The full namespace URL is the namespace identifier you're supposed to hard-code. The local prefix (owl) can change from file to file. Therefore doing what this answer suggests is a really bad idea.

Loïc Faure-Lacroix Over a year ago

@MattiVirkkunen exactly if the owl definition can change from file to file, shouldn't we use the definition defined in each files instead of hardcoding it?

Matti Virkkunen Over a year ago

@LoïcFaure-Lacroix: Usually XML libraries will let you abstract that part out. You don't need to even know or care about the prefix used in the file itself, you just define your own prefix for the purpose of parsing or just use the full namespace name.

Eelco van Vliet Over a year ago

this answer helped my to at least be able to use the find function. No need to create your own prefix. I just did key = list(root.nsmap.keys())[0] and then added the key as prefix: root.find(f'{key}:Tag2', root.nsmap)

Davide Brunato · Accepted Answer · 2017-02-21 08:15:53Z

47

Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.

To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

Then the dictionary can be passed as argument to the search functions:

root.findall('owl:Class', my_namespaces)

edited Feb 21, 2017 at 8:15

answered May 24, 2016 at 9:09

Davide Brunato

7426 silver badges8 bronze badges

7 Comments

delrocco Over a year ago

This is useful for those of us without access to lxml and without wanting to hardcode namespace.

Yuli Over a year ago

I got the error:ValueError: write to closed for this line filemy_namespaces = dict([node for _, node in ET.iterparse(StringIO(my_schema), events=['start-ns'])]). Any idea wants wrong?

Davide Brunato Over a year ago

Probably the error is related with the class io.StringIO, that refuses ASCII strings. I had tested my recipe with Python3. Adding the unicode string prefix 'u' to the sample string it works also with Python 2 (2.7).

Arminius Over a year ago

Instead of dict([...]) you can also use dict comprehension.

JustAC0der Over a year ago

Instead of StringIO(my_schema) you can also put the filename of the XML file.

|

Bram Vanroy · Accepted Answer · 2019-10-11 08:33:15Z

9

To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).

link = root.find(f"{ns}link")

edited Oct 11, 2019 at 8:33

answered Oct 1, 2018 at 12:25

Bram Vanroy

28.8k29 gold badges151 silver badges266 bronze badges

Comments

MJM · Accepted Answer · 2016-08-16 09:51:36Z

7

I've been using similar code to this and have found it's always worth reading the documentation... as usual!

findall() will only find elements which are direct children of the current tag. So, not really ALL.

It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included. If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.

root.iter()

ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements "Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"

answered Aug 16, 2016 at 9:51

MJM

3284 silver badges9 bronze badges

1 Comment

mzjn Over a year ago

The ElementTree documentation is a bit unclear and easy to misunderstand, IMHO. It is possible to get all descendants. Instead of elem.findall("X"), use elem.findall(".//X").

Maarten Derickx · Accepted Answer · 2021-04-07 16:13:53Z

This is basically Davide Brunato's answer however I found out that his answer had serious problems the default namespace being the empty string, at least on my python 3.6 installation. The function I distilled from his code and that worked for me is the following:

from io import StringIO
from xml.etree import ElementTree
def get_namespaces(xml_string):
    namespaces = dict([
            node for _, node in ElementTree.iterparse(
                StringIO(xml_string), events=['start-ns']
            )
    ])
    namespaces["ns0"] = namespaces[""]
    return namespaces

where ns0 is just a placeholder for the empty namespace and you can replace it by any random string you like.

If I then do:

my_namespaces = get_namespaces(my_schema)
root.findall('ns0:SomeTagWithDefaultNamespace', my_namespaces)

It also produces the correct answer for tags using the default namespace as well.

peter.slizik · Accepted Answer · 2024-01-05 15:08:23Z

1

My solution is based on @Martijn Pieters' comment:

register_namespace only influences serialisation, not search.

So the trick here is to use different dictionaries for serialization and for searching.

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

Now, register all namespaces for parsing and writing:

for name, value in namespaces.items():
    ET.register_namespace(name, value)

For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).

self.namespaces['default'] = self.namespaces['']

Now, the functions from the find() family can be used with the default prefix:

print root.find('default:myelem', namespaces)

but

tree.write(destination)

does not use any prefixes for elements in the default namespace.

edited Jan 5, 2024 at 15:08

answered May 30, 2019 at 11:00

peter.slizik

2,1183 gold badges19 silver badges29 bronze badges

3 Comments

Hermann12 Over a year ago

For python version 3.11 use namespaces.items() instead of namespaces.iteritems().

Frank Vel Over a year ago

@Hermann12 More generally it applies to any version 3.0 or higher.

peter.slizik Over a year ago

@Hermann12: Yes, I've update my answer accordingly. Thank you for pointing it out. Yet, .find() and .iterfinid() are ElementTree's methods and they behave differently.

Frank Vel · Accepted Answer · 2023-12-15 18:44:30Z

0

A slightly longer alternative is to create another class ElementNS which inherits ET.Element and includes the namespaces, then create a constructor for this class which is passed onto the parser:

import xml.etree.ElementTree as ET


def parse_namespaces(source):
    return dict(node for _e, node in ET.iterparse(source, events=['start-ns']))


def create_element_factory(namespaces):
    def element_factory(tag, attrib):
        el = ElementNS(tag, attrib)
        el.namespaces = namespaces
        return el
    return element_factory


class ElementNS(ET.Element):
    namespaces = None

    # Patch methods to include namespaces
    def find(self, path):
        return super().find(path, self.namespaces)

    def findtext(self, path, default=None):
        return super().findtext(path, default, self.namespaces)

    def findall(self, path):
        return super().findall(path, self.namespaces)

    def iterfind(self, path):
        return super().iterfind(path, self.namespaces)


def parse(source):
    # Set up parser with namespaced element factory
    namespaces = parse_namespaces(source)
    element_factory = create_element_factory(namespaces)
    tree_builder = ET.TreeBuilder(element_factory=element_factory)
    parser = ET.XMLParser(target=tree_builder)
    element_tree = ET.ElementTree()

    return element_tree.parse(source, parser=parser)

Then findall can be used without passing namespaces:

document = parse("filename")
document.findall("owl:Class")

edited Dec 15, 2023 at 18:44

answered Dec 15, 2023 at 18:38

Frank Vel

1,2181 gold badge14 silver badges29 bronze badges

2 Comments

Hermann12 Over a year ago

Very complicated to reach the same result as described above.

Frank Vel Over a year ago

@Hermann12 if you've written e.find(..., namespaces) a couple dozen times, it makes sense to make a class for it, so you only have to write e.find(...). However note that this is likely slower, as it can't rely on the C implementation of ET.Element.

Collectives™ on Stack Overflow

Parsing XML with namespace in Python via 'ElementTree'

8 Answers 8

13 Comments

4 Comments

7 Comments

Comments

1 Comment

Comments

3 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

13 Comments

4 Comments

7 Comments

Comments

1 Comment

Comments

3 Comments

2 Comments

Linked

Related