Removing non-ascii characters from any given stringtype in Python

Question

>>> teststring = 'aõ'
>>> type(teststring)
<type 'str'>
>>> teststring
'a\xf5'
>>> print teststring
aõ
>>> teststring.decode("ascii", "ignore")
u'a'
>>> teststring.decode("ascii", "ignore").encode("ascii")
'a'

which is what i really wanted it to store internally as i remove non-ascii characters. Why did the decode("ascii give out a unicode string ?

>>> teststringUni = u'aõ'
>>> type(teststringUni)
<type 'unicode'>
>>> print teststringUni
aõ
>>> teststringUni.decode("ascii" , "ignore")

Traceback (most recent call last):
  File "<pyshell#79>", line 1, in <module>
    teststringUni.decode("ascii" , "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.decode("utf-8" , "ignore")

Traceback (most recent call last):
  File "<pyshell#81>", line 1, in <module>
    teststringUni.decode("utf-8" , "ignore")
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.encode("ascii" , "ignore")
'a'

Which is again what i wanted. I don't understand this behavior. Can someone explain to me what is happening here?

edit: i thought this would me understand things so i could solve my real program problem that i state here: Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)

Ned Batchelder · Accepted Answer · 2010-09-08 13:25:39Z

4

It's simple: .encode converts Unicode objects into strings, and .decode converts strings into Unicode.

answered Sep 8, 2010 at 13:25

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Andrew Scott Evans Over a year ago

if this does not work, also try to use BeautifulSoup(html).encode for html or the regex module

Daniel Roseman · Accepted Answer · 2010-09-08 13:45:48Z

4

Why did the decode("ascii") give out a unicode string?

Because that's what decode is for: it decodes byte strings like your ASCII one into unicode.

In your second example, you're trying to "decode" a string which is already unicode, which has no effect. To print it to your terminal, though, Python must encode it as your default encoding, which is ASCII - but because you haven't done that step explicitly and therefore haven't specified the 'ignore' parameter, it raises the error that it can't encode the non-ASCII characters.

The trick to all of this is remembering that decode takes an encoded bytestring and converts it to Unicode, and encode does the reverse. It might be easier if you understand that Unicode is not an encoding.

edited Sep 8, 2010 at 13:45

answered Sep 8, 2010 at 13:25

Daniel Roseman

602k68 gold badges910 silver badges923 bronze badges

2 Comments

Jochen Ritzel Over a year ago

Well, you are right, except for some details. Since he can print 'a\xf5' correctly, his terminals encoding is not ascii but .. something else. The console encoding is a really common problem, but it's not the case this time. Also, teststringUni.decode("ascii" , "ignore") does not fail when you try to print the result. It tells Python that teststringUni is a ascii encoded string (it is clearly unicode, but Python trusts the user) and tries to decode it - which cannot work ofc.

gl00ten Over a year ago

yes, i think that is the problem: What is my terminal encoding? Just because an object type is string it does not mean the encoding is ascii, i understood that. My problem now is to figure out how i can translate something that has type unicode into the the string type of the terminal, while retaining all information.

Collectives™ on Stack Overflow

Removing non-ascii characters from any given stringtype in Python

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related