4
>>> teststring = 'aõ'
>>> type(teststring)
<type 'str'>
>>> teststring
'a\xf5'
>>> print teststring
aõ
>>> teststring.decode("ascii", "ignore")
u'a'
>>> teststring.decode("ascii", "ignore").encode("ascii")
'a'

which is what i really wanted it to store internally as i remove non-ascii characters. Why did the decode("ascii give out a unicode string ?

>>> teststringUni = u'aõ'
>>> type(teststringUni)
<type 'unicode'>
>>> print teststringUni
aõ
>>> teststringUni.decode("ascii" , "ignore")

Traceback (most recent call last):
  File "<pyshell#79>", line 1, in <module>
    teststringUni.decode("ascii" , "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.decode("utf-8" , "ignore")

Traceback (most recent call last):
  File "<pyshell#81>", line 1, in <module>
    teststringUni.decode("utf-8" , "ignore")
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.encode("ascii" , "ignore")
'a'

Which is again what i wanted. I don't understand this behavior. Can someone explain to me what is happening here?

edit: i thought this would me understand things so i could solve my real program problem that i state here: Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)

2 Answers 2

4

It's simple: .encode converts Unicode objects into strings, and .decode converts strings into Unicode.

Sign up to request clarification or add additional context in comments.

1 Comment

if this does not work, also try to use BeautifulSoup(html).encode for html or the regex module
4

Why did the decode("ascii") give out a unicode string?

Because that's what decode is for: it decodes byte strings like your ASCII one into unicode.

In your second example, you're trying to "decode" a string which is already unicode, which has no effect. To print it to your terminal, though, Python must encode it as your default encoding, which is ASCII - but because you haven't done that step explicitly and therefore haven't specified the 'ignore' parameter, it raises the error that it can't encode the non-ASCII characters.

The trick to all of this is remembering that decode takes an encoded bytestring and converts it to Unicode, and encode does the reverse. It might be easier if you understand that Unicode is not an encoding.

2 Comments

Well, you are right, except for some details. Since he can print 'a\xf5' correctly, his terminals encoding is not ascii but .. something else. The console encoding is a really common problem, but it's not the case this time. Also, teststringUni.decode("ascii" , "ignore") does not fail when you try to print the result. It tells Python that teststringUni is a ascii encoded string (it is clearly unicode, but Python trusts the user) and tries to decode it - which cannot work ofc.
yes, i think that is the problem: What is my terminal encoding? Just because an object type is string it does not mean the encoding is ascii, i understood that. My problem now is to figure out how i can translate something that has type unicode into the the string type of the terminal, while retaining all information.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.