4

That seems that I've used wrong function. With .fromstring - there're no error messages

xml_ = load() # here comes the unicode string with Cyrillic letters 

print xml_    # prints everything fine 

print type(xml_) # 'lxml.etree._ElementUnicodeResult' = unicode 

xml = xml_.decode('utf-8') # here is an error

doc = lxml.etree.parse(xml) # if I do not decode it - the same error appears here

 File "testLog.py", line 48, in <module>
    xml = xml_.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 89-96: ordinal not in range(128)

If

xml = xml_.encode('utf-8')

doc = lxml.etree.parse(xml) # here's an error

or

xml = xml_

then

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 89: ordinal not in range(128)

If I understand it right: I must decode non-ascii string into internal representation, then work with this representation and encode it back before sending to output? It seems that I do exactly this.

Input data must be in unt-8 due to the 'Accept-Charset': 'utf-8' header.

3
  • Is the error still about character encoding on etree.parse() call? what is the type of xml? etree.parse does not work on strings or unicode objects. Try etree.fromstring() instead. Commented Jul 8, 2012 at 18:06
  • @hasanyasin, it seems that you are right. :) Commented Jul 8, 2012 at 18:08
  • I will write a nice answer covering both problems hoping that you will accept is as correct answer. :) Commented Jul 8, 2012 at 18:09

5 Answers 5

6

String and Unicode objects have different types and different representations of their content in memory. Unicode is the decoded form of text while string is an encoded one.

# -*- coding: utf-8 --

# Now, my string literals in this source file will
#    be str objects encoded in utf-8.

# In Python3, they will be unicode objects.
#    Below examples show the Python2 way.

s = 'ş'
print type(s) # prints <type 'str'>

u = s.decode('utf-8')
# Here, we create a unicode object from a string
#    which was encoded in utf-8.

print type(u) # prints <type 'unicode'>

As you see,

.encode() --> str
.decode() --> unicode

When we encode to or decode from strings, we need to be sure that our text should be covered in the source/target encoding. An iso-8859-1 encoded string cannot be decoded correctly with iso-8859-9.

As for the second error report in the question, lxml.etree.parse() works on file-like objects. To parse from strings, lxml.etree.fromstring() should be used.

Sign up to request clarification or add additional context in comments.

Comments

2

If your original string is unicode it only makes sense to encode it to utf-8 not decode from utf-8.

I think the xml parser can handle only xml which is ascii.

So use xml = xml_.encode('ascii','xmlcharrefreplace') to convert the unicode characters that are not in ascii to xml entitities.

1 Comment

@hasanyasin : I encode the unicode string to bytes in the ascii encoding. This is very well possible. The Cyrillic strings are translated into xml entities. e.g. Ж becomes &#1046;.
1

I assume that you are trying to parse some web site?

Did you validate that the website is correct? Maybe they have the encoding incorrect?

Many websites are broken and rely on web browser to have very robust parsers. You could give beautifulsoup a try, it also is very robust.

There is the de-facto web standard that the "Charset" HTML header (which may include negotiation and relates to the Accept-Encoding you mention) is overruled by any <meta http-equiv=... tag in the HTML file!

So you might just not have a UTF-8 input!

Comments

1

The lxml library already puts things to unicode type for you. You're running into python2's unicode/bytes automatic conversion. The hint for that is that you're asking it to decode but you're getting an Encode error. It's trying to convert your utf8 string to the default bytes encoding then decode it back to unicode.

Use the .encode method on unicode objects to convert to bytes (str type).

Watching this will teach you a lot about how to solve this problem: http://nedbatchelder.com/text/unipain.html

Comments

1

For me, using .fromstring() method is what was needed.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.