2

So, I'm parsing this xml file of moderate size (about 27K lines). Not far into it, I'm seeing unexpected behavior from ElementTree.Element where I get Element.text for one entry but not the next, yet it's there in the source XML as you can see:

<!-- language: lang-xml -->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:enumeration value="24">
   <xs:annotation>
      <xs:documentation>UPC12 (item-specific) on cover 2</xs:documentation>
      <xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
   </xs:annotation>
</xs:enumeration>
<xs:enumeration value="25">
   <xs:annotation>
      <xs:documentation>UPC12+5 (item-specific) on cover 2</xs:documentation>
      <xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
   </xs:annotation>
</xs:enumeration>

When I encounter an enumeration tag I call this function:

import xml.etree.cElementTree as ElementTree
...
    def _parse_list_item(xmlns: str, list_id: int, itemElement: ElementTree.Element) -> ListItem:
      if isinstance(itemElement, ElementTree.Element):
        if itemElement.attrib['value'] is not None:
            item_id = itemElement.attrib['value']  # string
            if list_id == 6 and (item_id == '25' or item_id=='24'):
                print(list_id, item_id)  # <== debug break point here
            desc = None
            notes = ""
            for child in itemElement:
                if child.tag == (xmlns + 'annotation'):
                    for grandchild in child:
                        if grandchild.tag == (xmlns + 'documentation'):
                            if desc is None:
                                desc = grandchild.text
                            else:
                                if len(notes)>0:
                                    notes += " "  # add a space
                                notes += grandchild.text or ""
            if item_id is not None and desc is not None:
                return Codex.ListItem({'itemId': item_id, 'listId': list_id, 'description': desc, 'notes': notes})

If I place a breakpoint at the print statement, when I get to the enumeration node for "24" I can look at the text for the grandchild nodes and they are as shown in the XML, i.e. "UPC12..." or "AKA item...", but when I get to the enumeration node for "25", and look at the grandchild text, it's None.

When I remove the xs: namespace by pre-filtering the XML file, the grandchild text comes through fine.

Is it possible I'm over some size limit or is there some syntax problem?
Sorry for less-than-pythonic code but I wanted to be able to examine all the intermediate values in pycharm. It's python 3.6.

Thanks for any insights you may have!

2 Answers 2

1

In the for loop, this condition is never met: if child.tag == (xmlns + 'annotation'):.

Why?

Try to output the child's tag. If we suppose your namespace (xmlns) is 'Steve' then:

print(child.tag) will output: {Steve}annotation, not Steveannotation.

So given this fact, if child.tag == (xmlns + 'annotation'): is always False.
You should change it to: if child.tag == ('{'+xmlns+'}annotation'):

With the same logic, you will find out you will also have to change this condition:

if grandchild.tag == (xmlns + 'documentation'):

to:

if grandchild.tag == ('{'+xmlns+'}documentation'):
Sign up to request clarification or add additional context in comments.

4 Comments

Sorry, I see a bit more information is needed. The XML file contains this line near the top: <pre> <xs:schema xmlns:xs="w3.org/2001/XMLSchema"> </pre> And so before I call into this function, I've parsed this line and set xmlns to {w3.org/2001/XMLSchema} so when we come across child.tag = '{w3.org/2001/XMLSchema}annotation' then xmlns+'annotation' does match... I am considering removing the prefixes from the tags with pre-process if that's what's throwing me off.
@SteveL: the information in the comment should be part of the question. You should provide a minimal reproducible example.
@mzjn - thanks - info moved from comment to main body of question.
@SteveL: OK, but I still cannot copy and paste the code and just run it. You have not provided a minimal but complete piece of code that reproduces the problem.
0

So, ultimately, I solved my problem by running a pre-process on the XML file to remove the xs: namespace from all of the open/close XML tags and then I was able to successfully process the file using the function as defined above. Not sure why namespaces are causing problems, but perhaps there is a bug in cElementTree for namespace prefixes in large XML files. To @mzjn - I expect that it would be difficult to construct a minimal example as it does process hundreds of items correctly before it fails, so I would at least have to provide a fairly large XML file. Nevertheless, thanks for being a sounding board.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.