4

I have a MySQL column encoded as utf8. That utf8 is not actually the full utf8 set, but only BMP characters only up to 3 bytes in length. I don't want to try to insert utf8 into MySQL only to find it does not meet MySQL's parameters for what utf8 should be. Is there a way to test in Python if something meets MySQL's parameters before attempting to insert? For obvious reasons, catching exceptions on some_string.encode('utf-8') is not strict enough.

0

2 Answers 2

4
>>> len(u'\uffff'.encode('utf8')) < 4 # Good; fits in utf8
True
>>> len(u'\U00010000'.encode('utf8')) < 4 # Bad; utf8mb4 only
False
>>> ord(u'\uffff') < 65536 # Good; fits in utf8
True
>>> ord(u'\U00010000') < 65536 # Bad; utf8mb4 only
False
Sign up to request clarification or add additional context in comments.

Comments

2

To check whether a string contains a Unicode character above U+FFFF (and which thus can't be stored in a MySQL table using the "utf8" encoding), you can use the following regular expression:

re.match(u"[^\u0000-\uffff]", s)

Alternatively, if you can upgrade to MySQL 5.5 or later, you may want to consider converting your table to the utf8mb4 character set, which can store all Unicode characters.

1 Comment

You should use re.search instead of re.match, because when you try re.match("[^\u0000-\uffff]", "yura🍁") it return None! but re.search("[^\u0000-\uffff]", "yura🍁") find this 🍁 If you want to locate a match anywhere in string, use search() instead.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.