0

I'm trying to get rid of everything from a string excluding lowercase alphanumerics and whitespace.

The problem is when I use a unicode character like this:

re.sub(r'[^a-å_\s]', '', '¤☃')

It doesn't get removed why is this and what can I do about it?

2
  • what is the expected output from that input? Commented Nov 18, 2016 at 19:38
  • Nothing, if input was 'a¤b☃c' I'd want output to be 'abc' since I'm trying to remove everything but lowercase alphanumerics and whitespace? Commented Nov 18, 2016 at 19:40

3 Answers 3

2

You can use Unicode.

>>> re.sub(ur'[^a-å_\s]', u'', u'¤☃')
u'\xa4'
>>> print re.sub(ur'[^a-å_\s]', u'', u'¤☃')
¤
Sign up to request clarification or add additional context in comments.

1 Comment

When i do this it just returns the same as above, I want it to exclude those characters but it's not happening.
0

You can remove any non-ASCII character like this:

>>> import re
>>> 
>>> print re.sub(ur'[^\x00-\x7F]', u'', u'123aąść1b2d3')
123a1b2d3

If you want to perserve some additional non-ASCII just add them to the regexp.

print re.sub(ur'[^\x00-\x7Fæøø]', u'', u'123aąść1b2d3æøø')
123a1b2d3æøø

4 Comments

This works except I also want the danish characters 'æ' 'ø' 'å' to be kept, what can I do to keep those?
This also doesn't get rid of all symbols, stuff like # @ & still get through.
Because those are ASCII characters. Try like this: [^\x41-\x7A]. ASCII characters table you can find here: asciitable.com
Then just specify what you want perserve: [^A-z0-9]
0

Others have already explained that you need a unicode regex with unicode arguments to work with unicode properly; Python is likely storing '¤☃' in an encoded form, often UTF-8, which would store your input as '\xc2\xa4\xe2\x98\x83', and the regex itself would be '[^a-\xc3\xa5+_\\s]', which means your character class is excluding whitespace and ordinals from 97 to 195 (plus explicitly excluding 165, but that's in the previous range), not from ordinals 97 to 229 as you expected. Thing is, since the UTF-8 encoded input is represented by bytes in this range (aside from the e2 byte, which gets dropped), your output is only lightly filtered.

Even if you switch to using unicode properly, ord(u'¤') is 164, while ord(u'å') is 229; it correctly preserves ¤ because it's in the character class you've excluded from substitution.

You shouldn't be using regular expressions here, because it's not practical to exhaustively define all alphabetic and whitespace characters scattered across the Unicode range while excluding all the others. Instead, use the tools that actually use the Unicode database to inspect character properties:

>>> u''.join(x for x in u'a¤ ☃b' if x.isspace() or x.islower())
u'a b'

That's much clearer about exactly what you're trying to do, and it should be fast enough; the Unicode database that Python uses makes the cost of checking character attributes fairly trivial. If your inputs are arriving as str (encoded as UTF-8) and you must produce str output, you just convert to unicode, filter, then convert back:

>>> inp = 'a¤ ☃b'  # Not unicode!
>>> inpuni = inp.decode('utf-8')
>>> outpuni = u''.join(x for x in inpuni if x.isspace() or x.islower())
>>> outp = outpuni.encode('utf-8')
>>> outp
'a b'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.