3

I need to parse a 1.2GB XML file that has an encoding of "ISO-8859-1", and after reading a few articles on the NET, it seems that Python's ElementTree's iterparse() is preferred as to SAX parsing.

I've written a extremely short piece of code just to test it out, but it's prompting out an error that I've no idea how to solve.

My Code (Python 2.7):

from xml.etree.ElementTree import iterparse

for (event, node) in iterparse('dblp.xml', events=['start']):
    print node.tag
    node.clear()

Edit: Ahh, as the file was really big and laggy, I typed out the XML line, and made a mistake. It's "& uuml;" without the space. I apologize for this.

This code works fine until it hits a line in the XML file that looks like this:

<Journal>Technical Report 248, ETH Z&uuml;rich, Dept of Computer Science</Journal>

which I guess means Zurich, but the parser does not seem to know this.

Running the code above gave me an error:

xml.etree.ElementTree.ParseError: undefined entity &uuml;

Is there anyway I could solve this issue? I've googled quite a few solutions, but none seem to deal with this problem directly.

1
  • Ok, you've got an inconsistancy that needs resolving. In the XML you have &umml and in the error you have &uuml. If they are both &umml is because the XML is invalid and needs correcting. If they are both &uuml that is a defined entity so should work. If they are actually different you'll need to give some more info on the file. Commented Sep 17, 2013 at 4:07

1 Answer 1

2

Try following:

from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs

class CustomEntity:
    def __getitem__(self, key):
        if key == 'umml':
            key = 'uuml' # Fix invalid entity
        return unichr(htmlentitydefs.name2codepoint[key])

parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = CustomEntity()

for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
    print node.tag
    node.clear()

OR

from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs

parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = {'umml': unichr(htmlentitydefs.name2codepoint['uuml'])}

for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
    print node.tag
    node.clear()

Related question: Python ElementTree support for parsing unknown XML entities?

Sign up to request clarification or add additional context in comments.

2 Comments

Not sure why this is the accepted answer; the OP states that it's uuml, not umml...
I've seen it; I just wonder why this is the accepted answer, given that it doesn't really answer the question in its current form.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.