2

I know it sounds weird, but look at this:

mysql> select * from tbl_list_charset where word='aê';
+------+
| word |
+------+
| aª  | 
+------+

The data is coming from a file with utf-8 strings, which a python program reads and inserts into the table. As word column is defined unique, the insertion of fails.

The utf-8 representation of the strings in the file is:

aê = 61 C3 AA
aª = 61 C2 AA

My environment: linux, python 2.6.4, mysql 5.0.77 community edition

I am quite sure it is not a bug, but I am clueless of what I am doing wrong...

6
  • 2
    What collation does that column use? That's likely where your problem is located... Commented Jan 20, 2011 at 16:43
  • 1
    Probably related: stackoverflow.com/questions/4018950 Commented Jan 20, 2011 at 16:44
  • @Michael Madsen: I new I was missing something... :) how do I find this out? Commented Jan 20, 2011 at 16:46
  • @davka: SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='[table-name]' should show you the collation on the table. Commented Jan 20, 2011 at 16:50
  • @davka: For the command-line MySQL client, show full columns from table; should do the trick. Most frontends should also provide some way of checking it, and there's always the generic query Archimedix posted. Commented Jan 20, 2011 at 16:53

2 Answers 2

2

The collation determines which characters compare as "equal". And yes, there's quite a few of these situations. You can try the utf8_bin collation and you wont have this problem, but it will be case sensitive. The bin collations compare strictly, only seperating the characters out aqccording to the encoding selected, and once that's done, comparisons are done on a binary basis, much like manhy programming languages would compare strings.

If you need something in between this extreme and your current collation, you can make a custom collation. Or, you might be able to get it "good enough" by storing another column, and using a different collation on it, and just each col for specific purposes.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks. I see that I need to look into this topic. Case-sensitive is ok for me, so utf8_bin is probably the solution. I need to learn how to configure mysql properly for this.
1

Do you also use UTF-8 with the mysql client program as well as in your Python application ?
I.e. call mysql --default-character-set=utf8 and in Python issue at least one SET NAMES='utf8' before doing any other queries ?

2 Comments

using mysql --default-character-set=utf8 resulted in being equal to ae. I don't understand how the change in the client caused different behavior in the server...
I think that you already have a case-insensitive collation on the database or table, e.g. utf8_general_ci, which makes equal to ae in comparisons (as e with circonflexe is basically an e). I guess however that by not using utf8 as the default character set for the client you would probably compare to which cannot match even with case-insensitive matching.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.