Parse XML in Python with encoding other than utf-8

Question

Any clue on how to parse xml in python that has: encoding='Windows-1255' in it? At least the lxml.etree parser won't even look at the string when there's an "encoding" tag in the XML header which isn't "utf-8" or "ASCII".

Running the following code fails with:

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

from lxml import etree

parser = etree.XMLParser(encoding='utf-8')

def convert_xml_to_utf8(xml_str):
    tree = etree.fromstring(xml_str, parser=parser)
    if tree.docinfo.encoding == 'utf-8':
        # already in correct encoding, abort
        return xml_str
    decoded_str = xml_str.decode(tree.docinfo.encoding)
    utf8_encoded_str = decoded_str.encode('utf-8')
    tree = etree.fromstring(utf8_encoded_str)
    tree.docinfo.encoding = 'utf-8'
    return etree.tostring(tree, pretty_print = True, xml_declaration = True, encoding='UTF-8', standalone="yes")


data = '''<?xml version='1.0' encoding='Windows-1255'?><rss version="2.0"><channel ><title ><![CDATA[ynet - חדשות]]></title></channel></rss>'''
print(convert_xml_to_utf8(data))

Comments are not for extended discussion; this conversation has been moved to chat. — deceze
– deceze ♦, Commented Dec 19, 2017 at 11:25

deceze · Accepted Answer · 2017-12-19 11:09:26Z

7

data is a unicode str. The error is saying that such a thing which also contains an encoding="..." declaration is not supported, because a str is supposedly already decoded from its encoding and hence it's ambiguous/nonsensical that it would also contain an encoding declaration. It's telling you to use a bytes instead, e.g. data = b'<...>'. Presumably you should be opening a file in binary mode, read the data from there and let etree handle the encoding="...", instead of using string literals in your code, which complicates the encoding situation even further.

It's as simple as:

from xml.etree import ElementTree

#        open in binary mode ↓
with open('/tmp/test.xml', 'rb') as f:
    e = ElementTree.fromstring(f.read())

Et voilà, e contains your parsed file with the encoding having been (presumably) correctly interpreted by etree based on the file's internal encoding="..." header.

ElementTree in fact has a shortcut method for this:

e = ElementTree.parse('/tmp/test.xml')

edited Dec 19, 2017 at 11:09

answered Dec 19, 2017 at 11:03

deceze♦

525k89 gold badges806 silver badges954 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

rubmz Over a year ago

So if I have a byte array which I fetched from the network (rss/html) with non-ascii bytes in it the function will handle them?

deceze Over a year ago

Yes. If you feed bytes (not str) to the parser, it will parse the XML encoding declaration and decode the bytes to str based on that. In the above example f.read() produces the bytes, but that value can come from anywhere.

Collectives™ on Stack Overflow

Parse XML in Python with encoding other than utf-8

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related