ElementTree's iterparse() XML parsing error

Question

I need to parse a 1.2GB XML file that has an encoding of "ISO-8859-1", and after reading a few articles on the NET, it seems that Python's ElementTree's iterparse() is preferred as to SAX parsing.

I've written a extremely short piece of code just to test it out, but it's prompting out an error that I've no idea how to solve.

My Code (Python 2.7):

from xml.etree.ElementTree import iterparse

for (event, node) in iterparse('dblp.xml', events=['start']):
    print node.tag
    node.clear()

Edit: Ahh, as the file was really big and laggy, I typed out the XML line, and made a mistake. It's "& uuml;" without the space. I apologize for this.

This code works fine until it hits a line in the XML file that looks like this:

<Journal>Technical Report 248, ETH Z&uuml;rich, Dept of Computer Science</Journal>

which I guess means Zurich, but the parser does not seem to know this.

Running the code above gave me an error:

xml.etree.ElementTree.ParseError: undefined entity &uuml;

Is there anyway I could solve this issue? I've googled quite a few solutions, but none seem to deal with this problem directly.

Ok, you've got an inconsistancy that needs resolving. In the XML you have &umml and in the error you have &uuml. If they are both &umml is because the XML is invalid and needs correcting. If they are both &uuml that is a defined entity so should work. If they are actually different you'll need to give some more info on the file. — user764357
– user764357, Commented Sep 17, 2013 at 4:07

Community · Accepted Answer · 2017-05-23 12:15:16Z

2

Try following:

from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs

class CustomEntity:
    def __getitem__(self, key):
        if key == 'umml':
            key = 'uuml' # Fix invalid entity
        return unichr(htmlentitydefs.name2codepoint[key])

parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = CustomEntity()

for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
    print node.tag
    node.clear()

OR

from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs

parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = {'umml': unichr(htmlentitydefs.name2codepoint['uuml'])}

for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
    print node.tag
    node.clear()

Related question: Python ElementTree support for parsing unknown XML entities?

edited May 23, 2017 at 12:15

CommunityBot

11 silver badge

answered Sep 17, 2013 at 6:15

falsetru

371k69 gold badges769 silver badges659 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Clément Over a year ago

Not sure why this is the accepted answer; the OP states that it's uuml, not umml...

Clément Over a year ago

I've seen it; I just wonder why this is the accepted answer, given that it doesn't really answer the question in its current form.

Collectives™ on Stack Overflow

ElementTree's iterparse() XML parsing error

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related