4

Any clue on how to parse xml in python that has: encoding='Windows-1255' in it? At least the lxml.etree parser won't even look at the string when there's an "encoding" tag in the XML header which isn't "utf-8" or "ASCII".

Running the following code fails with:

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

from lxml import etree

parser = etree.XMLParser(encoding='utf-8')

def convert_xml_to_utf8(xml_str):
    tree = etree.fromstring(xml_str, parser=parser)
    if tree.docinfo.encoding == 'utf-8':
        # already in correct encoding, abort
        return xml_str
    decoded_str = xml_str.decode(tree.docinfo.encoding)
    utf8_encoded_str = decoded_str.encode('utf-8')
    tree = etree.fromstring(utf8_encoded_str)
    tree.docinfo.encoding = 'utf-8'
    return etree.tostring(tree, pretty_print = True, xml_declaration = True, encoding='UTF-8', standalone="yes")


data = '''<?xml version='1.0' encoding='Windows-1255'?><rss version="2.0"><channel ><title ><![CDATA[ynet - חדשות]]></title></channel></rss>'''
print(convert_xml_to_utf8(data))
1
  • Comments are not for extended discussion; this conversation has been moved to chat. Commented Dec 19, 2017 at 11:25

1 Answer 1

7

data is a unicode str. The error is saying that such a thing which also contains an encoding="..." declaration is not supported, because a str is supposedly already decoded from its encoding and hence it's ambiguous/nonsensical that it would also contain an encoding declaration. It's telling you to use a bytes instead, e.g. data = b'<...>'. Presumably you should be opening a file in binary mode, read the data from there and let etree handle the encoding="...", instead of using string literals in your code, which complicates the encoding situation even further.

It's as simple as:

from xml.etree import ElementTree

#        open in binary mode ↓
with open('/tmp/test.xml', 'rb') as f:
    e = ElementTree.fromstring(f.read())

Et voilà, e contains your parsed file with the encoding having been (presumably) correctly interpreted by etree based on the file's internal encoding="..." header.

ElementTree in fact has a shortcut method for this:

e = ElementTree.parse('/tmp/test.xml')
Sign up to request clarification or add additional context in comments.

2 Comments

So if I have a byte array which I fetched from the network (rss/html) with non-ascii bytes in it the function will handle them?
Yes. If you feed bytes (not str) to the parser, it will parse the XML encoding declaration and decode the bytes to str based on that. In the above example f.read() produces the bytes, but that value can come from anywhere.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.