Python: I use .decode() - 'ascii' codec can't encode

Question

That seems that I've used wrong function. With .fromstring - there're no error messages

xml_ = load() # here comes the unicode string with Cyrillic letters 

print xml_    # prints everything fine 

print type(xml_) # 'lxml.etree._ElementUnicodeResult' = unicode 

xml = xml_.decode('utf-8') # here is an error

doc = lxml.etree.parse(xml) # if I do not decode it - the same error appears here

 File "testLog.py", line 48, in <module>
    xml = xml_.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 89-96: ordinal not in range(128)

If

xml = xml_.encode('utf-8')

doc = lxml.etree.parse(xml) # here's an error

or

xml = xml_

then

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 89: ordinal not in range(128)

If I understand it right: I must decode non-ascii string into internal representation, then work with this representation and encode it back before sending to output? It seems that I do exactly this.

Input data must be in unt-8 due to the 'Accept-Charset': 'utf-8' header.

Is the error still about character encoding on etree.parse() call? what is the type of xml? etree.parse does not work on strings or unicode objects. Try etree.fromstring() instead. — hasanyasin
– hasanyasin, Commented Jul 8, 2012 at 18:06
I will write a nice answer covering both problems hoping that you will accept is as correct answer. :) — hasanyasin
– hasanyasin, Commented Jul 8, 2012 at 18:09

hasanyasin · Accepted Answer · 2012-07-08 18:21:17Z

String and Unicode objects have different types and different representations of their content in memory. Unicode is the decoded form of text while string is an encoded one.

# -*- coding: utf-8 --

# Now, my string literals in this source file will
#    be str objects encoded in utf-8.

# In Python3, they will be unicode objects.
#    Below examples show the Python2 way.

s = 'ş'
print type(s) # prints <type 'str'>

u = s.decode('utf-8')
# Here, we create a unicode object from a string
#    which was encoded in utf-8.

print type(u) # prints <type 'unicode'>

As you see,

.encode() --> str
.decode() --> unicode

When we encode to or decode from strings, we need to be sure that our text should be covered in the source/target encoding. An iso-8859-1 encoded string cannot be decoded correctly with iso-8859-9.

As for the second error report in the question, lxml.etree.parse() works on file-like objects. To parse from strings, lxml.etree.fromstring() should be used.

Marco de Wit · Accepted Answer · 2012-07-08 18:11:10Z

2

If your original string is unicode it only makes sense to encode it to utf-8 not decode from utf-8.

I think the xml parser can handle only xml which is ascii.

So use xml = xml_.encode('ascii','xmlcharrefreplace') to convert the unicode characters that are not in ascii to xml entitities.

edited Jul 8, 2012 at 18:11

answered Jul 8, 2012 at 17:56

Marco de Wit

2,8362 gold badges20 silver badges23 bronze badges

1 Comment

Marco de Wit Over a year ago

@hasanyasin : I encode the unicode string to bytes in the ascii encoding. This is very well possible. The Cyrillic strings are translated into xml entities. e.g. Ж becomes Ж.

Has QUIT--Anony-Mousse · Accepted Answer · 2012-07-08 18:06:54Z

1

I assume that you are trying to parse some web site?

Did you validate that the website is correct? Maybe they have the encoding incorrect?

Many websites are broken and rely on web browser to have very robust parsers. You could give beautifulsoup a try, it also is very robust.

There is the de-facto web standard that the "Charset" HTML header (which may include negotiation and relates to the Accept-Encoding you mention) is overruled by any <meta http-equiv=... tag in the HTML file!

So you might just not have a UTF-8 input!

answered Jul 8, 2012 at 18:06

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

Comments

Voo · Accepted Answer · 2012-07-08 18:13:21Z

1

The lxml library already puts things to unicode type for you. You're running into python2's unicode/bytes automatic conversion. The hint for that is that you're asking it to decode but you're getting an Encode error. It's trying to convert your utf8 string to the default bytes encoding then decode it back to unicode.

Use the .encode method on unicode objects to convert to bytes (str type).

Watching this will teach you a lot about how to solve this problem: http://nedbatchelder.com/text/unipain.html

edited Jul 8, 2012 at 18:13

Voo

30.4k13 gold badges91 silver badges165 bronze badges

answered Jul 8, 2012 at 17:58

Daenyth

37.8k15 gold badges92 silver badges130 bronze badges

Comments

Ben Usman · Accepted Answer · 2017-05-04 18:34:15Z

1

For me, using .fromstring() method is what was needed.

edited May 4, 2017 at 18:34

answered Mar 18, 2014 at 20:15

Ben Usman

8,4776 gold badges48 silver badges66 bronze badges

Collectives™ on Stack Overflow

Python: I use .decode() - 'ascii' codec can't encode

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related