2

Given the following XML

<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom">
  <id>1</id>
  <title>Example XML</title>
  <published>2021-12-15T00:00:00Z</published>
  <updated>2022-01-06T12:44:47Z</updated>
  <content type="application/xml">
    <articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  chemaVersion="1.8" xml:lang="en">
      <articleDocHead>
        <itemInfo/>
      </articleDocHead>
    </articleDoc>
  </content>
</entry>

How can I get the value of the xml:lang attribute in entry/content/articleDoc attribute? I've checked the Python Docs but it unfortunately doesn't cover attributes with namespaces. The solution if found by manually writing the namespace in front of the attribute-name as a dictionary key seems wrong. I'm working with Python 3.9.9.

Here's my code so far:

import xml.etree.cElementTree as tree

xml = """<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom">
  <id>1</id>
  <title>Example XML</title>
  <published>2021-12-15T00:00:00Z</published>
  <updated>2022-01-06T12:44:47Z</updated>
  <content type="application/xml">
    <articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" schemaVersion="1.8" xml:lang="en">
      <articleDocHead>
        <itemInfo/>
      </articleDocHead>
    </articleDoc>
  </content>
</entry>"""
ns = {'nitf': 'http://iptc.org/std/NITF/2006-10-18/',
      'w3': 'http://www.w3.org/2005/Atom',
      'xml': 'http://www.w3.org/XML/1998/namespace'}
root = tree.fromstring(xml)
id = root.find("w3:id", ns).text # works
print(id)
type_attribute = root.find("w3:content", ns).attrib['type'] # works
print(type_attribute)

#language = root.find("w3:content/articleDoc/articleDocHeader[xml:lang']", ns) # doesn't work
language = root.find("w3:content/articleDoc", ns).attrib['{http://www.w3.org/XML/1998/namespace}lang'] # works, but seems wrong
print(language)

Any help is appreciated. Thanks a lot!

3
  • stackoverflow.com/a/61781919/407651 unfortunately does not answer my question since I need to extract the value from the attribute after finding the element. Or does it mean there's no better way than to hardcode the .attrib['{w3.org/XML/1998/namespace}lang'] string for each attribute? Commented Jan 7, 2022 at 11:32
  • 1
    OK, see stackoverflow.com/a/62368982/407651. It may look a little clumsy, but you need to use {http://www.w3.org/XML/1998/namespace}lang (with either get() or attrib), Commented Jan 7, 2022 at 11:42
  • 3
    With the built-in ElementTree, spelling out the canonical name of the attribute is the best you can do, since attributes are implemented as dicts on elements instead of stand-alone attribute nodes, and XPath support is only rudimentary. With lxml, you can use a complete implementation of XPath, including namespace prefixes for attributes, i.e. this would work as expected: tree.xpath('//@xml:lang', namespaces=ns) and give ['en']. Commented Jan 7, 2022 at 12:01

1 Answer 1

0

Here a quick guide how to orient in a xml file using lxml.etree

In [2]: import lxml.etree as etree

In [3]: xml = """
   ...:     <entry xmlns="http://www.w3.org/2005/Atom" xmlns:demo="http://www.wh
   ...: atever.com">
   ...:       <id>1</id>
   ...:       <demo:demo_child>some namespace entry</demo:demo_child>
   ...:       <title>Example XML</title>
   ...:       <published>2021-12-15T00:00:00Z</published>
   ...:       <updated>2022-01-06T12:44:47Z</updated>
   ...:       <content type="application/xml">
   ...:         <articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema
   ...: -instance" schemaVersion="1.8" xml:lang="en">
   ...:           <articleDocHead>
   ...:             <itemInfo/>
   ...:           </articleDocHead>
   ...:         </articleDoc>
   ...:       </content>
   ...:     </entry>"""

In [4]: tree = etree.fromstring(xml)

In [5]: tree
Out[5]: <Element {http://www.w3.org/2005/Atom}entry at 0x7d010c153800>

In [6]: list(tree.iterchildren())  # get children of cuurent element
Out[6]: 
[<Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>,
 <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>,
 <Element {http://www.w3.org/2005/Atom}title at 0x7d010c9c5180>,
 <Element {http://www.w3.org/2005/Atom}published at 0x7d01233d6cc0>,
 <Element {http://www.w3.org/2005/Atom}updated at 0x7d010c0d4580>,
 <Element {http://www.w3.org/2005/Atom}content at 0x7d010c0d46c0>]

In [7]: print([el.tag for el in tree.iterchildren()])    # get children of cuurent element (human readable)
['{http://www.w3.org/2005/Atom}id', '{http://www.whatever.com}demo_child', '{http://www.w3.org/2005/Atom}title', '{http://www.w3.org/2005/Atom}published', '{http://www.w3.org/2005/Atom}updated', '{http://www.w3.org/2005/Atom}content']

In [8]: print(tree[0] == next(tree.iterchildren()))  # you can also access by #tree[index]
True

In [9]: tree.find('id')  # FAILS: did not consider default namespace

In [10]: tree.find('{http://www.w3.org/2005/Atom}id')  # now it works
Out[10]: <Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>

In [11]: tree.find('{http://www.w3.org/2005/Atom}demo_child')  # FAILS: element with non-default namespace

In [12]: tree.find('{http://www.whatever.com}demo_child')  # take proper namespace
Out[12]: <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>

In [13]: tree.find(f'{{{tree.nsmap["demo"]}}}demo_child')  # do not spell out full namespace
Out[13]: <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>

In [14]: tree.find('{http://www.w3.org/2005/Atom}content').find('articleDoc')  # follow path of elements
Out[14]: <Element articleDoc at 0x7d010c13d9c0>

In [15]: tree.xpath('//tmp_ns:id', namespaces={'tmp_ns': tree.nsmap[None]})  # use xpath, handling default namespace is tedious here
Out[15]: [<Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>]

In [16]: tree.xpath('//articleDoc')  # find elements not being a direct child
Out[16]: [<Element articleDoc at 0x7d010c13d9c0>]

In [17]: tree.xpath('//@type')  # search for attribute
Out[17]: ['application/xml']

In [18]: tree.xpath('//@xml:lang')  # search for other attribute
Out[18]: ['en']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.