204

I have the following XML which I want to parse using Python's ElementTree:

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

Because of the namespace, I am getting the following error.

SyntaxError: prefix 'owl' not found in prefix map

I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.

Kindly let me know how to change the code to find all the owl:Class tags.

1
  • 3
    Since Python 3.8, a namespace wildcard can be used with find(), findall() and findtext(). See stackoverflow.com/a/62117710/407651. Commented Jul 19, 2021 at 19:50

8 Answers 8

270

You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

Also see the Parsing XML with Namespaces section of the ElementTree documentation.

As of Python 3.8, the ElementTree library also understands the {*} namespace wildcard, so root.findall('{*}Class') would also work (but don't do that if your document can have multiple namespaces that define the Class element).

If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in .nsmap attribute on elements and generally has superior namespaces support.

Sign up to request clarification or add additional context in comments.

13 Comments

Thank you. Any idea how can I get the namespace directly from XML, without hard-coding it? Or how can I ignore it? I've tried findall('{*}Class') but it wont work in my case.
You'd have to scan the tree for xmlns attributes yourself; as stated in the answer, lxml does this for you, the xml.etree.ElementTree module does not. But if you are trying to match a specific (already hardcoded) element, then you are also trying to match a specific element in a specific namespace. That namespace is not going to change between documents any more than the element name is. You may as well hardcode that with the element name.
@Jon: register_namespace only influences serialisation, not search.
Small addition that may be useful: when using cElementTree instead of ElementTree, findall will not take namespaces as a keyword argument, but rather simply as a normal argument, i.e. use ctree.findall('owl:Class', namespaces).
@Bludwarf: The docs do mention it (now, if not when you wrote that), but you have to read them verrrry carefully. See the Parsing XML with Namespaces section: there's an example contrasting the use of findall without and then with the namespace argument, but the argument is not mentioned as one of the arguments to the method method in the Element object section.
|
69

Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

UPDATE:

5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.

Here's another case and how I handled it:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

4 Comments

The full namespace URL is the namespace identifier you're supposed to hard-code. The local prefix (owl) can change from file to file. Therefore doing what this answer suggests is a really bad idea.
@MattiVirkkunen exactly if the owl definition can change from file to file, shouldn't we use the definition defined in each files instead of hardcoding it?
@LoïcFaure-Lacroix: Usually XML libraries will let you abstract that part out. You don't need to even know or care about the prefix used in the file itself, you just define your own prefix for the purpose of parsing or just use the full namespace name.
this answer helped my to at least be able to use the find function. No need to create your own prefix. I just did key = list(root.nsmap.keys())[0] and then added the key as prefix: root.find(f'{key}:Tag2', root.nsmap)
47

Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.

To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

Then the dictionary can be passed as argument to the search functions:

root.findall('owl:Class', my_namespaces)

7 Comments

This is useful for those of us without access to lxml and without wanting to hardcode namespace.
I got the error:ValueError: write to closed for this line filemy_namespaces = dict([node for _, node in ET.iterparse(StringIO(my_schema), events=['start-ns'])]). Any idea wants wrong?
Probably the error is related with the class io.StringIO, that refuses ASCII strings. I had tested my recipe with Python3. Adding the unicode string prefix 'u' to the sample string it works also with Python 2 (2.7).
Instead of dict([...]) you can also use dict comprehension.
Instead of StringIO(my_schema) you can also put the filename of the XML file.
|
9

To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).

link = root.find(f"{ns}link")

Comments

7

I've been using similar code to this and have found it's always worth reading the documentation... as usual!

findall() will only find elements which are direct children of the current tag. So, not really ALL.

It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included. If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.

root.iter()

ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements "Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"

1 Comment

The ElementTree documentation is a bit unclear and easy to misunderstand, IMHO. It is possible to get all descendants. Instead of elem.findall("X"), use elem.findall(".//X").
3

This is basically Davide Brunato's answer however I found out that his answer had serious problems the default namespace being the empty string, at least on my python 3.6 installation. The function I distilled from his code and that worked for me is the following:

from io import StringIO
from xml.etree import ElementTree
def get_namespaces(xml_string):
    namespaces = dict([
            node for _, node in ElementTree.iterparse(
                StringIO(xml_string), events=['start-ns']
            )
    ])
    namespaces["ns0"] = namespaces[""]
    return namespaces

where ns0 is just a placeholder for the empty namespace and you can replace it by any random string you like.

If I then do:

my_namespaces = get_namespaces(my_schema)
root.findall('ns0:SomeTagWithDefaultNamespace', my_namespaces)

It also produces the correct answer for tags using the default namespace as well.

Comments

1

My solution is based on @Martijn Pieters' comment:

register_namespace only influences serialisation, not search.

So the trick here is to use different dictionaries for serialization and for searching.

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

Now, register all namespaces for parsing and writing:

for name, value in namespaces.items():
    ET.register_namespace(name, value)

For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).

self.namespaces['default'] = self.namespaces['']

Now, the functions from the find() family can be used with the default prefix:

print root.find('default:myelem', namespaces)

but

tree.write(destination)

does not use any prefixes for elements in the default namespace.

3 Comments

For python version 3.11 use namespaces.items() instead of namespaces.iteritems().
@Hermann12 More generally it applies to any version 3.0 or higher.
@Hermann12: Yes, I've update my answer accordingly. Thank you for pointing it out. Yet, .find() and .iterfinid() are ElementTree's methods and they behave differently.
0

A slightly longer alternative is to create another class ElementNS which inherits ET.Element and includes the namespaces, then create a constructor for this class which is passed onto the parser:

import xml.etree.ElementTree as ET


def parse_namespaces(source):
    return dict(node for _e, node in ET.iterparse(source, events=['start-ns']))


def create_element_factory(namespaces):
    def element_factory(tag, attrib):
        el = ElementNS(tag, attrib)
        el.namespaces = namespaces
        return el
    return element_factory


class ElementNS(ET.Element):
    namespaces = None

    # Patch methods to include namespaces
    def find(self, path):
        return super().find(path, self.namespaces)

    def findtext(self, path, default=None):
        return super().findtext(path, default, self.namespaces)

    def findall(self, path):
        return super().findall(path, self.namespaces)

    def iterfind(self, path):
        return super().iterfind(path, self.namespaces)


def parse(source):
    # Set up parser with namespaced element factory
    namespaces = parse_namespaces(source)
    element_factory = create_element_factory(namespaces)
    tree_builder = ET.TreeBuilder(element_factory=element_factory)
    parser = ET.XMLParser(target=tree_builder)
    element_tree = ET.ElementTree()

    return element_tree.parse(source, parser=parser)

Then findall can be used without passing namespaces:

document = parse("filename")
document.findall("owl:Class")

2 Comments

Very complicated to reach the same result as described above.
@Hermann12 if you've written e.find(..., namespaces) a couple dozen times, it makes sense to make a class for it, so you only have to write e.find(...). However note that this is likely slower, as it can't rely on the C implementation of ET.Element.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.