1

String with encoding from MYSQL DB Query comes through as mangled in Java

Using Java, I am connecting to a MYSQL database where I query a table to glean a field from it.

The field has UTF8 encoded text, that is:

Córas Éireann

The above text is what I see when I log in to the MYSQL console and look at that row in the table. The mysql version deployed is: mysql Ver 14.14 Distrib 5.1.52, for unknown-linux-gnu (x86_64) using readline 5.1

If I use a python program to connect to the same database and query the same table and get the same row, the text looks correct, i.e. it comes as Córas Éireann

However, when I query it through Java, the text comes out as:

Córas Éireann

which I suspect is Western (ISO latin1), but I am not sure, just a guess.

I did a show table status and saw that the table I am querying has Collation of utf8_general_ci

The table I am querying does not have any adddtionaly encoding definitions, the field, name, is a varchar(512).

+--------------------+--------------+------+-----+---------+----------------+
| Field              | Type         | Null | Key | Default | Extra          |
+--------------------+--------------+------+-----+---------+----------------+
| id                 | int(11)      | NO   | PRI | NULL    | auto_increment |
| name               | varchar(512) | YES  |     | NULL    |                |

The SQL connect JAR I imported is mysql-connector-java-5.1.36, I also tried v 5.1.34 and 5.0.8 but that made no difference.

This is how I connect to the DB:

    String dbStr =
    String.format("jdbc:mysql://%s:%d/%s?useUnicode=yes&characterEncoding=UTF-8", LOCAL_MYSQL_HOST,
                                       LOCAL_MYSQL_PORT, LOCAL_MYSQL_DB);
    try {
        cdb = DriverManager.getConnection(dbStr, LOCAL_MYSQL_USER, LOCAL_MYSQL_PASS);
        Statement dbStatement = cdb.createStatement();
        String query = String.format("SELECT name FROM customer WHERE id=%d",customerId);
        ResultSet row = dbStatement.executeQuery(query);
        if (row.first()) {
             System.out.println("name is " + row.getString("name");
        }
    } catch (SQLException exc) {
        exc.printStackTrace();
    }

Please note, my original implementation did not contain

?useUnicode=yes&characterEncoding=UTF-8

but adding it did not make it better or worse. I added it as I thought this might be the culprit. I also tried latin1 instead of utf-8 to see if this made a difference but no luck, the result is exactly the same, i.e. it comes out as Córas Éireann.

I even tried things like:

                byte[] rowBytes = row.getBytes("name");
                String utfdecocedStr = new String(rowBytes, "UTF-8");
                System.out.println(utfdecocedStr);

but the output remains as Córas Éireann

In python, I do not do any decoding/encoding, the query and connection are all basic, and I get the correct string. Is there a step I am missing that needs to be done on the DB or Java side to get this to work? In my.conf, I do not have any settings to setup any encoding config.

Python approach:

  import MySQLdb
  cdb = MySQLdb.connect(host=LOCAL_MYSQL_HOST,port=LOCAL_MYSQL_PORT,
                         user=LOCAL_MYSQL_USER,
                         passwd=LOCAL_MYSQL_PASS,db=LOCAL_MYSQL_DB)
  ccursor = self.cdb.cursor()
  query = """SELECT name FROM customer WHERE id='%d' """%(customer_id)
  row = ccursor.execute(query)
  if row:
      customername = ccursor.fetchone()

Thank you...Amro

---------Update 20150811---------------

I ran the following commands on the database and I found the following configuration which probably explains the behavior I am seeing:

show variables like 'character%';
+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        | 
| character_set_database   | latin1                                      |
| character_set_filesystem | binary                                      |
| character_set_results    | utf8                                        |
| character_set_server     | latin1                                      |
| character_set_system     | utf8                                        |
| character_sets_dir       | /usr/share/percona-xtradb-cluster/charsets/ |
+--------------------------+---------------------------------------------+
8 rows in set (0.00 sec)

mysql> show variables like 'collation%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

By adding the following to my.cnf:

character-set-server = utf8
character-set-filesystem = utf8

Then it the tables evolve to:

mysql> show variables like 'character%';

+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        |
| character_set_database   | latin1                                      |
| character_set_filesystem | utf8                                        |
| character_set_results    | utf8                                        |
| character_set_server     | utf8                                        |
| character_set_system     | utf8                                        |
| character_sets_dir       | /usr/share/percona-xtradb-cluster/charsets/ |
+--------------------------+---------------------------------------------+


+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | utf8_general_ci   |
+----------------------+-------------------+

So I issued the command on the DB console:

 ALTER DATABASE <dbname> CHARACTER SET utf8;

show variables like 'character%';

+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        |
| character_set_database   | utf8                                        |
| character_set_filesystem | utf8                                        |
| character_set_results    | utf8                                        |
| character_set_server     | utf8                                        |
| character_set_system     | utf8                                        |
| character_sets_dir       | /usr/share/percona-xtradb-cluster/charsets/ |
+--------------------------+---------------------------------------------+
8 rows in set (0.00 sec)

mysql> show variables like 'collation%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+

Unfortunately this did not resolve the issue still.

Can someone let me know how I can correct this hopefully without purging the database?

5
  • Firstly, you need to be sure that while storing the field, you are using the same UTF-8 format. Is data corrupt in DB itself ? You may try with ISO-8859-1 in new String(rowBytes, "ISO-8859-1"). But where are you getting this as a output Córas Éireann on your server console or on your IDE console ? Commented Aug 11, 2015 at 5:10
  • How are you viewing the output in java? If this is through eclipse, could you try changing the encoding of the eclipse console at Run -> Run Configurations -> Common Commented Aug 11, 2015 at 5:11
  • @UtKarash, the data in the DB is fine, as I pointed out if I view the data by connecting to the mysql console or python, the encoding shows up properly. So the DB data is not corrupt. I am using IntelliJ as my IDE, and its set to UTF-8. The output Córas Éireann is dumped on my server console (terminal window) where I run my java program from and my python program from. Commented Aug 11, 2015 at 5:27
  • You seems to be doing all the right things. May be check this : stackoverflow.com/questions/5405236/… In above link my.cnf is for linux and my.ini for windows. Adding the character set default-character-set=utf8 in these files resolves the issue sometimes. Commented Aug 11, 2015 at 5:48
  • @UTKarash, thanks I will give that a try and let you know. I have to see if I need to restart the DB to force a re-read of my.conf. The DB is deployed on a linux machine. Commented Aug 11, 2015 at 5:58

2 Answers 2

1

First off I realized that the table I was using has encoding of UTF-8, but the source table feeding into it is latin1 which confirmed my suspicion of what I thought the encoding was.

The person who wrote the code to copy the data from one to another did no encoding translation, so I feel by that point the data had corrupted encoding.

I performed various experiments that included connecting to database with

?useUnicode=yes&characterEncoding=UTF-8

I also played with the following on the SQL console:

SET character_set_client=latin1;
SET character_set_connection=latin1;
SET character_set_database=latin1;
SET character_set_results=latin1;

as well as variations of playing with my.cnf and setting:

[mysqld] 
character-set-server = utf8 
character-set-filesystem = utf8

Anyway, none of this helped. So finally to prove that it is not a java issue but its a corrupted data set, I created my own table with UTF8 encoding, stored the names in there and had my program pull the data. It looked correct. So now its a matter of fixing the original table contents.

The only thing I could not explain is how the python program did not flag this as an issue. This is not the first time where I encountered the forgiveness of python over Java's strong typing which is a blessing and a pain in these cases.

==============================

Update 2015/08/19:

When I fixed the source table to UTF8 and stored the data properly, the Java code worked but the python code broke.

To fix it in python I merely added the

self.cdb = MySQLdb.connect(host=LOCAL_MYSQL_HOST,port=LOCAL_MYSQL_PORT,
                         user=LOCAL_MYSQL_USER, 
  passwd=LOCAL_MYSQL_PASS,
  db=LOCAL_MYSQL_DB,use_unicode=True,charset="UTF8")

This resolved it.

Unfortunately, when I tried to work around this earlier with the Java driver it did not work:

dbStr = String.format("jdbc:mysql://%s:%d/%s?useUnicode=yes&characterEncoding=latin1", esConfig.LOCAL_MYSQL_HOST, esConfig.LOCAL_MYSQL_PORT, esConfig.LOCAL_MYSQL_DB);

I am using java 1.7.80 and the jdbc package is mysql-connector-java-5.1.36-bin.jar, the latest I found from Oracle

Sign up to request clarification or add additional context in comments.

Comments

1

You have Mojibake.

  • The bytes you have in the client are correctly encoded in utf8 (good).
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
  • xx The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)
  • The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.

If you need to fix for the data it takes a "2-step ALTER", something like

ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;

4 Comments

Thank you for the feedback, I have tried to modify the destination table before but I saw no difference on the Java side. My concern was that the source table was in latin1 and the destination table we copied into is UTF-8. Making both tables UTF-8 yielded the consistent results I wanted.
How did you "copy"? One way works will translate latin1 to utf8; another way mangles the bytes.
Unfotunately I don't have that information on how the copy happened. Another person used java and copied it without any translation. However as we all now, Java defaults to UTF-16 so I'm not sure if that contributed to the issue. It was not detected because using python or SQL console we never saw any corruption
I think Java uses utf16 or ucs2 internally, but presents utf8 externally. Córas Éireann involves latin1, not utf16/ucs2. For further discussion, can you provide SELECT col, HEX(col) FROM tbl WHERE ... to see what was actually stored.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.