1

I have a table in MySQL (5.7) database, which has collation utf8_unicode_ci, and where I'm inserting some data with Python (3.6).

With some of the strings (for example, '\xCE\xA6') I get "Incorrect string value" error. On the DB side, I can mute this error by turning off the strict mode in MySQL, or changing the field's collation to utf8mb4.

However, such strings are "anomalies", and it is not desirable to change a collation or the sql_mode.

How can I detect in Python 3, that a given string will lead to "incorrect string value" error with MySQL, before inserting into a Table ?

4
  • Try decode() Commented Jun 3, 2018 at 14:34
  • @Solarflare, thanks. String doesn't have decode attribute, however, even if one transforms the given string ('\xCE\xA6' in this case) into a byte array and decodes into utf8, it just works in Python. How does one deduce from there, that MySQL will through an error ? Commented Jun 3, 2018 at 15:36
  • Yes, you need to convert it to a bytearray first. I expected your sequence to be non-unicode (and then python would complain about it and could be used to validate), but it seems to be. So I just tested your 2 bytes on mysql and it works for me; so your problem might be somewhere else (it's oftentimes a connection setting). If you are asking if a specific byte sequence will work in your specific setup (and not if if a specific byte sequence is valid unicode and should work), the only way to be 100% sure is to let MySQL verify it, e.g. insert it into a temporary table and catch an error. Commented Jun 3, 2018 at 17:26
  • @Solarflare, the connection's charset is set to utf8, so that is not what causes the problem, I think. I read somewhere that utf8 for Python does not coincide with the utf8_unicode_ci collation implemented in MySQL. The problem is that I don't know how to spot the difference. But thanks for your reply, I'll try to dig more. Commented Jun 3, 2018 at 17:47

1 Answer 1

2

Where do you get the error message? What operation is being performed?

C3A6 is the UTF-8 (cf MySQL's utf8 or utf8mb4) hex for æ; does it seem likely that that was the desired character?

To handle utf8 (or utf8mb4), you need to determine what the client's encoding. Sounds like UTF-8. So, when connecting to MySQL, tell it that -- use these in the connect call:

charset="utf8", use_unicode=True

If the character is in the python source, you need

# -*- coding: utf-8 -*-

at the beginning of the source.

Also the column you are inserting into needs to be CHARACTER SET utf8 (or utf8mb4).

utf8mb4 is needed for Emoji and some of Chinese; otherwise it is 'equivalent' to utf8.

Do not use decode() or any other conversion functions; that will just make things harder to fix. In this arena, two wrongs does not make a right; it makes a worse mess.

If you have other symptoms of garbled characters, see Trouble with UTF-8 characters; what I see is not what I stored

To discuss further, please provide the connection call, the SQL statement involved, SHOW CREATE TABLE, and anything else involved.

C3A6 is a valid utf8/utf8mb4 character æ, and could be interpreted as valid, though unlikely, latin1 æ. But it is invalid for CHARACTER SET ascii. (I don't know how the error message occurred unless the connection said ascii or some obscure charset.)

Sign up to request clarification or add additional context in comments.

6 Comments

thanks for your comments (+1). I could fix the error after all, and that was a silly mistake on my side. I wrote a Python script to populate my database with some test data (randomly building from some real data). So, it modifies over 40 Tables at once (lot of things are connected), and a certain field in one of the tables had latin collation instead of utf8, this was causing the error.
I'm still wondering though, given this stackoverflow.com/questions/10957238/… question, if the utf8 is understood (slightly) differently in MySQL than it is in Python, for example (of perhaps other languages as well). The answer in the link is from 2012 , is it still so in current versions of MySQL ?
@Hayk - latin1 vs utf8: The table does not matter; the connection does. MySQL is willing to change the encoding on the fly if the connection parameters correctly say what the encoding is in the client.
@Hayk - \xF0\x90\x8D\x83 is in utf8mb4, but not in utf8; the leading F0 is the tip-off. There is no valid utf8 (3-byte max) encoding starting with F0. (I added a new last paragraph.)
thanks for the comments, but I'm not sure I am following you on ``...The table does not matter; the connection does...". Say, if I have a connection from Python to MySQL (by MySQLDB module, for example), with charset = utf8, then I should get an error if I try to insert a utf8 but non latin1 character into a latin1 collation field, shouldn't I ?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.