python ElementTree.Element missing text?

Question

So, I'm parsing this xml file of moderate size (about 27K lines). Not far into it, I'm seeing unexpected behavior from ElementTree.Element where I get Element.text for one entry but not the next, yet it's there in the source XML as you can see:

<!-- language: lang-xml -->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:enumeration value="24">
   <xs:annotation>
      <xs:documentation>UPC12 (item-specific) on cover 2</xs:documentation>
      <xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
   </xs:annotation>
</xs:enumeration>
<xs:enumeration value="25">
   <xs:annotation>
      <xs:documentation>UPC12+5 (item-specific) on cover 2</xs:documentation>
      <xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
   </xs:annotation>
</xs:enumeration>

When I encounter an enumeration tag I call this function:

import xml.etree.cElementTree as ElementTree
...
    def _parse_list_item(xmlns: str, list_id: int, itemElement: ElementTree.Element) -> ListItem:
      if isinstance(itemElement, ElementTree.Element):
        if itemElement.attrib['value'] is not None:
            item_id = itemElement.attrib['value']  # string
            if list_id == 6 and (item_id == '25' or item_id=='24'):
                print(list_id, item_id)  # <== debug break point here
            desc = None
            notes = ""
            for child in itemElement:
                if child.tag == (xmlns + 'annotation'):
                    for grandchild in child:
                        if grandchild.tag == (xmlns + 'documentation'):
                            if desc is None:
                                desc = grandchild.text
                            else:
                                if len(notes)>0:
                                    notes += " "  # add a space
                                notes += grandchild.text or ""
            if item_id is not None and desc is not None:
                return Codex.ListItem({'itemId': item_id, 'listId': list_id, 'description': desc, 'notes': notes})

If I place a breakpoint at the print statement, when I get to the enumeration node for "24" I can look at the text for the grandchild nodes and they are as shown in the XML, i.e. "UPC12..." or "AKA item...", but when I get to the enumeration node for "25", and look at the grandchild text, it's None.

When I remove the xs: namespace by pre-filtering the XML file, the grandchild text comes through fine.

Is it possible I'm over some size limit or is there some syntax problem?
Sorry for less-than-pythonic code but I wanted to be able to examine all the intermediate values in pycharm. It's python 3.6.

Thanks for any insights you may have!

Billal BEGUERADJ · Accepted Answer · 2018-05-06 18:01:18Z

1

In the for loop, this condition is never met: if child.tag == (xmlns + 'annotation'):.

Why?

Try to output the child's tag. If we suppose your namespace (xmlns) is 'Steve' then:

print(child.tag) will output: {Steve}annotation, not Steveannotation.

So given this fact, if child.tag == (xmlns + 'annotation'): is always False.
You should change it to: if child.tag == ('{'+xmlns+'}annotation'):

With the same logic, you will find out you will also have to change this condition:

if grandchild.tag == (xmlns + 'documentation'):

to:

if grandchild.tag == ('{'+xmlns+'}documentation'):

answered May 6, 2018 at 18:01

Billal BEGUERADJ

23k45 gold badges125 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Steve L Over a year ago

Sorry, I see a bit more information is needed. The XML file contains this line near the top: <pre> <xs:schema xmlns:xs="w3.org/2001/XMLSchema"> </pre> And so before I call into this function, I've parsed this line and set xmlns to {w3.org/2001/XMLSchema} so when we come across child.tag = '{w3.org/2001/XMLSchema}annotation' then xmlns+'annotation' does match... I am considering removing the prefixes from the tags with pre-process if that's what's throwing me off.

mzjn Over a year ago

@SteveL: the information in the comment should be part of the question. You should provide a minimal reproducible example.

Steve L Over a year ago

@mzjn - thanks - info moved from comment to main body of question.

mzjn Over a year ago

@SteveL: OK, but I still cannot copy and paste the code and just run it. You have not provided a minimal but complete piece of code that reproduces the problem.

Steve L · Accepted Answer · 2018-05-08 00:38:37Z

0

So, ultimately, I solved my problem by running a pre-process on the XML file to remove the xs: namespace from all of the open/close XML tags and then I was able to successfully process the file using the function as defined above. Not sure why namespaces are causing problems, but perhaps there is a bug in cElementTree for namespace prefixes in large XML files. To @mzjn - I expect that it would be difficult to construct a minimal example as it does process hundreds of items correctly before it fails, so I would at least have to provide a fairly large XML file. Nevertheless, thanks for being a sounding board.

answered May 8, 2018 at 0:38

Steve L

1,6553 gold badges20 silver badges27 bronze badges

Collectives™ on Stack Overflow

python ElementTree.Element missing text?

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related