python - Parse XML with unicode characters into ElementTree

Question

I'm using PDFminer, but it contains a bug and I get the following invalid XML file:

<?xml version="1.1" encoding="UTF-8"?>
<string size="16">&#244;&#130;&#204;&#2;f&#198;&#135;&#143;&#11;*&#154;&#23;]&#214;&#20;[</string>

When I'm trying to parse it with ElementTree I'm getting the following error:

    bookXml = xml.etree.ElementTree.parse(filename)
  File "C:\Users\User\Anaconda3\lib\xml\etree\ElementTree.py", line 1196, in parse
    tree.parse(source, parser)
  File "C:\Users\User\Anaconda3\lib\xml\etree\ElementTree.py", line 597, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 36

I think best way to handle this case is to fix XML first, but how?

The problem seems to be that  (and a few others) that equals U+0002, that AFAIK is not a valid character in a XML file. — rodrigo
– rodrigo, Commented Oct 13, 2017 at 10:20
Oh, XML version is "1.1"!! You don't see that everyday. Then I guess that the U+0002 is correct after all, but you'll have a hard time finding compatible tools... — rodrigo
– rodrigo, Commented Oct 13, 2017 at 10:22

james-see · Accepted Answer · 2017-10-12 17:39:02Z

1

I would wrap the offending XML string in CDATA. Confirmed working as soon as I did this. Example:

    <?xml version="1.1" encoding="UTF-8"?>
<string><![CDATA[&#244;&#130;&#204;&#2;&#198;&#135;&#143;&#11;*&#154;&#23;&#214;&#20;]]></string>

More about CDATA here.

answered Oct 12, 2017 at 17:39

james-see

13.3k6 gold badges47 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

happy_marmoset Over a year ago

This is only temporary solution, because now I need to call html.unescape() to get required value.

Collectives™ on Stack Overflow

python - Parse XML with unicode characters into ElementTree

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related