Parsing UTF-8 XML-Files with Python

Question

i've got an XML-File with contains some german umlauts. My goal is to read in the file and store the results into a database. For testing I got two different files. The first is according to chardet UTF-8-SIG the other one is UTF-8.

Preprocessing the data is done by unicode(field[0]) after reading the file with lxml

Parsing the first file works fine, but processing the other results in an encoding error: UnicodeEncodeError: 'ascii' codec can't encode characters in position: ordinal not in range(128)

For example such string can be u'Zubeh\xf6r' (print(field[0]).

Using print (field[0].encode("utf-8")) results in the right string, but the type is str instead of unicode

Take a look at this question: stackoverflow.com/questions/28852321/… — rafaelc
– rafaelc, Commented Aug 24, 2015 at 22:42

barlaso · Accepted Answer · 2015-08-25 03:41:25Z

1

Try

from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')

when you read the data with lxml.

answered Aug 25, 2015 at 3:41

barlaso

2912 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jwacalex Over a year ago

encoding is correct, but it's still "str" as type instead of "unicode"

barlaso Over a year ago

When you 'encode' a Unicode string you'll get 'str'. What are you trying to do exactly, if you post some code, I can help you better.

jwacalex Over a year ago

I'm trying to get data from a xml-file and store it to an utf8 database via django-orm mapper.

Collectives™ on Stack Overflow

Parsing UTF-8 XML-Files with Python

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related