How to remove special characters from strings in python?

Question

I have millions of strings scraped from web like:

s = 'WHAT\xe2\x80\x99S UP DOC?'
type(s) == str # returns True

Special characters like in the string above are inevitable when scraping from the web. How should one remove all such special characters to retain just clean text? I am thinking of regular expression like this based on my very limited experience with unicode characters:

\\x.*[0-9]

They are not special characters, that is a utf-8 encoded string, which when printed will output WHAT’S UP DOC? — Padraic Cunningham
– Padraic Cunningham, Commented Aug 18, 2015 at 19:09
So you want any non-ascii removed? i.e print(s.decode("ascii",errors="ignore")) — Padraic Cunningham
– Padraic Cunningham, Commented Aug 18, 2015 at 19:12

Cory Kramer · Accepted Answer · 2015-08-18 19:10:05Z

3

The special characters are not actually multiple characters long, that is just how they are represented so your regex isn't going to work. If you print you will see the actual unicode (utf-8) characters

>>> s = 'WHAT\xe2\x80\x99S UP DOC?'
>>> print(s)
WHATâS UP DOC?
>>> repr(s)
"'WHATâ\\x80\\x99S UP DOC?'"

If you want to print only the ascii characters you can check if the character is in string.printable

>>> import string
>>> ''.join(i for i in s if i in string.printable)
'WHATS UP DOC?'

answered Aug 18, 2015 at 19:10

Cory Kramer

119k19 gold badges176 silver badges233 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

pg2455 · Accepted Answer · 2015-08-18 19:23:50Z

2

This thing worked for me as mentioned by Padriac in comments:

s.decode('ascii', errors='ignore')

answered Aug 18, 2015 at 19:23

pg2455

5,25814 gold badges54 silver badges82 bronze badges

Collectives™ on Stack Overflow

How to remove special characters from strings in python?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related