0

I have some strings of roughly 100 characters and I need to detect if each string contains an unicode character. The final purpose is to check if some particular emojis are present, but initially I just want a filter that catches all emojis (as well as potentially other special characters). This method should be fast.

I've seen Python regex matching Unicode properties, but I cannot use any custom packages. I'm using Python 2.7. Thanks!

10
  • 1
    All characters are Unicode characters. The simple test would be if string:; just test for non-empty strings. Any character Python can put in a string is part of the Unicode standard. Commented Sep 19, 2016 at 18:25
  • Perhaps you meant to test for non-ASCII codepoints or something similar? Commented Sep 19, 2016 at 18:26
  • Are you just checking for emoji's? Technically, all the ASCII characters are also present in unicode as well, so you need to be a little more specific when you say you're "checking for unicode characters". Commented Sep 19, 2016 at 18:27
  • 1
    I would highly recommend reading this primer on unicode -- joelonsoftware.com/articles/Unicode.html Commented Sep 19, 2016 at 18:27
  • 3
    @sln: not quite. This post looks like a dupe of Is there a specific range of unicode code points which can be checked for emojis? at this point. Commented Sep 19, 2016 at 18:28

1 Answer 1

1

There is no point is testing 'if a string contains Unicode characters', because all characters in a string are Unicode characters. The Unicode standard encompasses all codepoints that Python supports, including the ASCII range (Unicode codepoints U+0000 through to U+007F).

If you want to test for Emoji code, test for specific ranges, as outlined by the Unicode Emoji class specification:

re.compile(
    u'[\u231A-\u231B\u2328\u23CF\23E9-\u23F3...\U0001F9C0]',
    flags=re.UNICODE)

where you'll have to pick and choose what codepoints you consider to be Emoji. I personally would not include U+0023 NUMBER SIGN in that category for example, but apparently the Unicode standard does.

Note: To be explicit, the above expression is not complete. There are 209 separate entries in the Emoji category and I didn't feel like writing them all out.

Another note: the above uses a \Uhhhhhhhh wide Unicode escape sequence; its use is only supported in a regex pattern in Python 3.3 and up, or in a wide (UCS-4) build for earlier versions of Python. For a narrow Python build, you'll have to match on surrogate pairs for codepoints over U+FFFF.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks, that makes sense. Now if I only want to do a rough version that catches all emojis, but also other strings, would return '\' in mystr on some encoding of the string that would reveal all those backslashes then work?
I'm in Python 2.7
@pir: there are no backslashes in strings; you can't test for escape sequences because escape sequences are just a way to make it easier to specify a specific codepoint.
Isn't there some way in Python to reveal these escape sequences?
No, because all characters can be expressed with either an escape sequence or the literal value, given the right source code encoding.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.