0

i've got an XML-File with contains some german umlauts. My goal is to read in the file and store the results into a database. For testing I got two different files. The first is according to chardet UTF-8-SIG the other one is UTF-8.

Preprocessing the data is done by unicode(field[0]) after reading the file with lxml

Parsing the first file works fine, but processing the other results in an encoding error: UnicodeEncodeError: 'ascii' codec can't encode characters in position: ordinal not in range(128)

For example such string can be u'Zubeh\xf6r' (print(field[0]).

Using print (field[0].encode("utf-8")) results in the right string, but the type is str instead of unicode

2

1 Answer 1

1

Try

from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')

when you read the data with lxml.

Sign up to request clarification or add additional context in comments.

3 Comments

encoding is correct, but it's still "str" as type instead of "unicode"
When you 'encode' a Unicode string you'll get 'str'. What are you trying to do exactly, if you post some code, I can help you better.
I'm trying to get data from a xml-file and store it to an utf8 database via django-orm mapper.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.