Checking if string contains unicode using standard Python

Question

I have some strings of roughly 100 characters and I need to detect if each string contains an unicode character. The final purpose is to check if some particular emojis are present, but initially I just want a filter that catches all emojis (as well as potentially other special characters). This method should be fast.

I've seen Python regex matching Unicode properties, but I cannot use any custom packages. I'm using Python 2.7. Thanks!

All characters are Unicode characters. The simple test would be if string:; just test for non-empty strings. Any character Python can put in a string is part of the Unicode standard. — Martijn Pieters
– Martijn Pieters, Commented Sep 19, 2016 at 18:25
Perhaps you meant to test for non-ASCII codepoints or something similar? — Martijn Pieters
– Martijn Pieters, Commented Sep 19, 2016 at 18:26
Are you just checking for emoji's? Technically, all the ASCII characters are also present in unicode as well, so you need to be a little more specific when you say you're "checking for unicode characters". — Brendan Abel
– Brendan Abel, Commented Sep 19, 2016 at 18:27
I would highly recommend reading this primer on unicode -- joelonsoftware.com/articles/Unicode.html — Brendan Abel
– Brendan Abel, Commented Sep 19, 2016 at 18:27
@sln: not quite. This post looks like a dupe of Is there a specific range of unicode code points which can be checked for emojis? at this point. — Martijn Pieters
– Martijn Pieters, Commented Sep 19, 2016 at 18:28

Community · Accepted Answer · 2017-05-23 12:32:42Z

1

There is no point is testing 'if a string contains Unicode characters', because all characters in a string are Unicode characters. The Unicode standard encompasses all codepoints that Python supports, including the ASCII range (Unicode codepoints U+0000 through to U+007F).

If you want to test for Emoji code, test for specific ranges, as outlined by the Unicode Emoji class specification:

re.compile(
    u'[\u231A-\u231B\u2328\u23CF\23E9-\u23F3...\U0001F9C0]',
    flags=re.UNICODE)

where you'll have to pick and choose what codepoints you consider to be Emoji. I personally would not include U+0023 NUMBER SIGN in that category for example, but apparently the Unicode standard does.

Note: To be explicit, the above expression is not complete. There are 209 separate entries in the Emoji category and I didn't feel like writing them all out.

Another note: the above uses a \Uhhhhhhhh wide Unicode escape sequence; its use is only supported in a regex pattern in Python 3.3 and up, or in a wide (UCS-4) build for earlier versions of Python. For a narrow Python build, you'll have to match on surrogate pairs for codepoints over U+FFFF.

edited May 23, 2017 at 12:32

CommunityBot

11 silver badge

answered Sep 19, 2016 at 18:37

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

pir Over a year ago

Thanks, that makes sense. Now if I only want to do a rough version that catches all emojis, but also other strings, would return '\' in mystr on some encoding of the string that would reveal all those backslashes then work?

pir Over a year ago

I'm in Python 2.7

Martijn Pieters Over a year ago

@pir: there are no backslashes in strings; you can't test for escape sequences because escape sequences are just a way to make it easier to specify a specific codepoint.

pir Over a year ago

Isn't there some way in Python to reveal these escape sequences?

Martijn Pieters Over a year ago

No, because all characters can be expressed with either an escape sequence or the literal value, given the right source code encoding.

|

Collectives™ on Stack Overflow

Checking if string contains unicode using standard Python

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related