6

How to replace characters that cannot be decoded using utf8 with whitespace?

# -*- coding: utf-8 -*-
print unicode('\x97', errors='ignore') # print out nothing
print unicode('ABC\x97abc', errors='ignore') # print out ABCabc

How can I print out ABC abc instead of ABCabc? Note, \x97 is just an example character. The characters that cannot be decoded are unknown inputs.

  • If we use errors='ignore', it will print out nothing.
  • If we use errors='replace', it will replace that character with some special chars.

2 Answers 2

9

Take a look at codecs.register_error. You can use it to register custom error handlers

https://docs.python.org/2/library/codecs.html#codecs.register_error

import codecs
codecs.register_error('replace_with_space', lambda e: (u' ',e.start + 1))
print unicode('ABC\x97abc', encoding='utf-8', errors='replace_with_space')
Sign up to request clarification or add additional context in comments.

1 Comment

Does stack overflow allow more than 1 solution? both @Kasramvd and you provide excellent answers... what to do in this case..
3

You can use a try-except statement to handle the UnicodeDecodeError :

def my_encoder(my_string):
   for i in my_string:
      try :
         yield unicode(i)
      except UnicodeDecodeError:
         yield '\t' #or another whietespaces 

And then use str.join method to join your string :

print ''.join(my_encoder(my_string))

Demo :

>>> print ''.join(my_encoder('this is a\x97n exam\x97ple'))
this is a   n exam  ple

4 Comments

\x97 is just an example character. The characters that cannot be decoded are unknown inputs.
@DehengYe Just a typo, fixed
very helpful answer! @Kasramvd
I hope you don't mind. Both you and @HelloWorld provide excellent answers. But Stack Overflow allows only one solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.