1

I am trying to parse information from XML file using Python's xml module. Problem is that when I specify list of files and start parsing strategy, after first file being (supposedly) successfully parsed, I am getting following error:

Parsing 20586908.xml ..
Parsing 20586934.xml ..


Traceback (most recent call last):
  File "<ipython-input-72-0efdae22e237>", line 11, in parse
    xmlTree = ET.parse(xmlFilePath, parser = self.parser)
  File "C:\Users\StefanCepa995\miniconda3\envs\dl4cv\lib\xml\etree\ElementTree.py", line 1202, in parse
    tree.parse(source, parser)
  File "C:\Users\StefanCepa995\miniconda3\envs\dl4cv\lib\xml\etree\ElementTree.py", line 601, in parse
    parser.feed(data)
xml.etree.ElementTree.ParseError: parsing finished: line 1755, column 0

Here is the code I am using to parse XML files:

class INBreastXMLParser:
    def __init__(self, xmlRootDir):
        self.parser         = ET.XMLParser(encoding="utf-8")
        self.xmlAnnotations = [os.path.join(root, f)
                                   for root, dirs, files in os.walk(xmlRootDir)
                                              for f in files if f.endswith('.xml')]
    def parse(self):
        for xmlFilePath in self.xmlAnnotations:
            logger.info(f"Parsing {os.path.basename(xmlFilePath)} ..")
            try:
                xmlTree = ET.parse(xmlFilePath, parser = self.parser)
                root    = xmlTree.getroot()
            except Exception as err:
                logging.error(f"Could not parse {xmlFilePath}. Reason - {err}")
                traceback.print_exc()
                

And here is the screenshot of the part of the file where parsing fails:

enter image description here

0

2 Answers 2

6

The problem is that the ET.XMLParser instance is reused. The underlying XML library (Expat) that is used by ElementTree does not support this:

Due to limitations in the Expat library used by pyexpat, the xmlparser instance returned can only be used to parse a single XML document. Call ParserCreate for each document to provide unique parser instances.

You need to create a new parser for each XML file. Move

self.parser = ET.XMLParser(encoding="utf-8") 

from the __init__ method to the parse method.

Sign up to request clarification or add additional context in comments.

1 Comment

This is exactly what the problem was. Thanks!
1

Parse errors can and do happen. They have exactly one reason: The parser errors. And even it's only one reason, the causes can be plenty. Three common ones:

  • The input is invalid (e.g. invalid XML in your example)
  • The parser is incompatible (e.g. the XML input is valid, but (encoded) in a form or variant the parser can not handle)
  • The parser has errors itself (e.g. Software Bugs)

As the parser you have in use is written in software and there is normally a bug in each ~173 lines of code, this could be worth a quick look.

But only if you can look fast. It might not be worth because more often the problem is with the input. So maybe worth to look into that first.

In any case you're lucky. It seems like you want to process XML and tooling exists! Check the validation of the file on disk, your program gives you a hint already that it might be invalid with the parse error.

Also move it out of that directory and start your script again. It might not be the only file that is invalid and you might want to find out how many of the remaining files cause an issue with your script as fast as possible, too.

1 Comment

So the third bullet was it then.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.