Python regex not working with unicode

Question

I'm trying to get rid of everything from a string excluding lowercase alphanumerics and whitespace.

The problem is when I use a unicode character like this:

re.sub(r'[^a-å_\s]', '', '¤☃')

It doesn't get removed why is this and what can I do about it?

Nothing, if input was 'a¤b☃c' I'd want output to be 'abc' since I'm trying to remove everything but lowercase alphanumerics and whitespace? — user7179775
– user7179775, Commented Nov 18, 2016 at 19:40

Ignacio Vazquez-Abrams · Accepted Answer · 2016-11-18 19:40:42Z

2

You can use Unicode.

>>> re.sub(ur'[^a-å_\s]', u'', u'¤☃')
u'\xa4'
>>> print re.sub(ur'[^a-å_\s]', u'', u'¤☃')
¤

answered Nov 18, 2016 at 19:40

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user7179775 Over a year ago

When i do this it just returns the same as above, I want it to exclude those characters but it's not happening.

grzgrzgrz3 · Accepted Answer · 2016-11-18 20:04:46Z

0

You can remove any non-ASCII character like this:

>>> import re
>>> 
>>> print re.sub(ur'[^\x00-\x7F]', u'', u'123aąść1b2d3')
123a1b2d3

If you want to perserve some additional non-ASCII just add them to the regexp.

print re.sub(ur'[^\x00-\x7Fæøø]', u'', u'123aąść1b2d3æøø')
123a1b2d3æøø

edited Nov 18, 2016 at 20:04

answered Nov 18, 2016 at 19:53

grzgrzgrz3

3502 silver badges8 bronze badges

4 Comments

user7179775 Over a year ago

This works except I also want the danish characters 'æ' 'ø' 'å' to be kept, what can I do to keep those?

user7179775 Over a year ago

This also doesn't get rid of all symbols, stuff like # @ & still get through.

grzgrzgrz3 Over a year ago

Because those are ASCII characters. Try like this: [^\x41-\x7A]. ASCII characters table you can find here: asciitable.com

grzgrzgrz3 Over a year ago

Then just specify what you want perserve: [^A-z0-9]

ShadowRanger · Accepted Answer · 2016-11-18 22:32:12Z

Others have already explained that you need a unicode regex with unicode arguments to work with unicode properly; Python is likely storing '¤☃' in an encoded form, often UTF-8, which would store your input as '\xc2\xa4\xe2\x98\x83', and the regex itself would be '[^a-\xc3\xa5+_\\s]', which means your character class is excluding whitespace and ordinals from 97 to 195 (plus explicitly excluding 165, but that's in the previous range), not from ordinals 97 to 229 as you expected. Thing is, since the UTF-8 encoded input is represented by bytes in this range (aside from the e2 byte, which gets dropped), your output is only lightly filtered.

Even if you switch to using unicode properly, ord(u'¤') is 164, while ord(u'å') is 229; it correctly preserves ¤ because it's in the character class you've excluded from substitution.

You shouldn't be using regular expressions here, because it's not practical to exhaustively define all alphabetic and whitespace characters scattered across the Unicode range while excluding all the others. Instead, use the tools that actually use the Unicode database to inspect character properties:

>>> u''.join(x for x in u'a¤ ☃b' if x.isspace() or x.islower())
u'a b'

That's much clearer about exactly what you're trying to do, and it should be fast enough; the Unicode database that Python uses makes the cost of checking character attributes fairly trivial. If your inputs are arriving as str (encoded as UTF-8) and you must produce str output, you just convert to unicode, filter, then convert back:

>>> inp = 'a¤ ☃b'  # Not unicode!
>>> inpuni = inp.decode('utf-8')
>>> outpuni = u''.join(x for x in inpuni if x.isspace() or x.islower())
>>> outp = outpuni.encode('utf-8')
>>> outp
'a b'

Collectives™ on Stack Overflow

Python regex not working with unicode

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related