3

I'm using sqlalchemy core to execute string based queries. I have set charset to utf8mb4 on the connection string like this:

"mysql+mysqldb://{user}:{password}@{host}:{port}/{db}?charset=utf8mb4"

For some simple select queries (e.g, select name from users where id=XXX limit 1), when the resultset has some unicode characters (e.g, ', ì, etc), it errors out with the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 11: invalid start byte

But the error itself is not reproducible. When I run the same query from a python shell, it works without errors. But it errors out on a web request or background job.

I'm using Python 3.8 and sqlalchemy 1.3.24.

I have also tried explicitly specifying charset: utf8mb4 as a connect_args property with create_engine().

The underlying database is mysql 5.7 and all the unicode columns have utf8mb4 explicitly set as the characters set in the schema. Update: The database is actually AWS RDS Aurora MySQL.

Appreciate any insights on the error or how to reproduce it reliably.

2 Answers 2

2

The MySQL documentation Connect-Time Error Handling describes a bug in the MySQL 8.0 client library when you use the MySQL 8.0 client library to connect to a MySQL 5.7 server with the utf8mb4 charset. The MySQL 8.0 client asks for the utf8mb4_0900_ai_ci collation, but the MySQL 5.7 server does not recognize that collation, so the server silently falls back to the latin1 charset with latin1_swedish_ci collation. Subsequently the server sends latin1 result sets, but the client thinks that it is receiving utf8mb4, which eventually results in a UnicodeDecodeError. As a workaround you have to explicitly SET NAMES utf8mb4. I created an issue mysqlclient#504 to ask that the python client do that every time.

To confirm that the charset is incorrect after connecting, double check the server’s value of character_set_client (the charset that statements are interpreted in), character_set_connection (the charset that statements are converted to), and character_set_results (the charset that result sets are sent as). If they are latin1 despite you trying to connect using utf8mb4, then this bug may have been triggered.

with con.cursor() as c:
  c.execute("show variables like 'character_set_%'")
  for row in c:
    print(row)
(b'character_set_client', b'latin1')
(b'character_set_connection', b'latin1')
(b'character_set_database', b'latin1')
(b'character_set_filesystem', b'binary')
(b'character_set_results', b'latin1')
(b'character_set_server', b'latin1')
(b'character_set_system', b'utf8')
(b'character_sets_dir', b'/usr/share/mysql/charsets/')

I believe that a workaround of the issue would be to do the following after connecting:

# explicitly set connection charset to the same as MySQLdb.connect()
con.query("SET NAMES utf8mb4")
con.store_result()
Sign up to request clarification or add additional context in comments.

2 Comments

For my case, only the character_set_database and character_set_server are set to latin1, the rest are correctly set as utf8mb4. Would you still say that this is the root cause?
@ModasserBillah, character_set_server configures the default character set used for CREATE TABLE among other things, so it’s highly recommended to set it to utf8mb4 to prevent other errors. But I don’t think that it would cause a client-side UnicodeDecodeError.
1

Can you try with use_unicode=true parameter in the url?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.