1

I'm using PDFminer, but it contains a bug and I get the following invalid XML file:

<?xml version="1.1" encoding="UTF-8"?>
<string size="16">&#244;&#130;&#204;&#2;f&#198;&#135;&#143;&#11;*&#154;&#23;]&#214;&#20;[</string>

When I'm trying to parse it with ElementTree I'm getting the following error:

    bookXml = xml.etree.ElementTree.parse(filename)
  File "C:\Users\User\Anaconda3\lib\xml\etree\ElementTree.py", line 1196, in parse
    tree.parse(source, parser)
  File "C:\Users\User\Anaconda3\lib\xml\etree\ElementTree.py", line 597, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 36

I think best way to handle this case is to fix XML first, but how?

2
  • The problem seems to be that &#2; (and a few others) that equals U+0002, that AFAIK is not a valid character in a XML file. Commented Oct 13, 2017 at 10:20
  • 1
    Oh, XML version is "1.1"!! You don't see that everyday. Then I guess that the U+0002 is correct after all, but you'll have a hard time finding compatible tools... Commented Oct 13, 2017 at 10:22

1 Answer 1

1

I would wrap the offending XML string in CDATA. Confirmed working as soon as I did this. Example:

    <?xml version="1.1" encoding="UTF-8"?>
<string><![CDATA[&#244;&#130;&#204;&#2;&#198;&#135;&#143;&#11;*&#154;&#23;&#214;&#20;]]></string>

More about CDATA here.

Sign up to request clarification or add additional context in comments.

1 Comment

This is only temporary solution, because now I need to call html.unescape() to get required value.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.