1

I have millions of strings scraped from web like:

s = 'WHAT\xe2\x80\x99S UP DOC?'
type(s) == str # returns True

Special characters like in the string above are inevitable when scraping from the web. How should one remove all such special characters to retain just clean text? I am thinking of regular expression like this based on my very limited experience with unicode characters:

\\x.*[0-9]
7
  • 3
    They are not special characters, that is a utf-8 encoded string, which when printed will output WHAT’S UP DOC? Commented Aug 18, 2015 at 19:09
  • stackoverflow.com/questions/5843518/… Commented Aug 18, 2015 at 19:09
  • 1
    So you want any non-ascii removed? i.e print(s.decode("ascii",errors="ignore")) Commented Aug 18, 2015 at 19:12
  • 1
    Works like butter. Thanks!! Commented Aug 18, 2015 at 19:13
  • 1
    @mousecoder, work away Commented Aug 18, 2015 at 19:17

2 Answers 2

3

The special characters are not actually multiple characters long, that is just how they are represented so your regex isn't going to work. If you print you will see the actual unicode (utf-8) characters

>>> s = 'WHAT\xe2\x80\x99S UP DOC?'
>>> print(s)
WHATâS UP DOC?
>>> repr(s)
"'WHATâ\\x80\\x99S UP DOC?'"

If you want to print only the ascii characters you can check if the character is in string.printable

>>> import string
>>> ''.join(i for i in s if i in string.printable)
'WHATS UP DOC?'
Sign up to request clarification or add additional context in comments.

Comments

2

This thing worked for me as mentioned by Padriac in comments:

s.decode('ascii', errors='ignore')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.