SQL Query and Unicode Issue

Question

I have a really weird issue with Sql queries on unicode data. Here's what I've got:

Sql Server Express 2008 R2 AS
Table containing chinese characters/words/phrases (100,000 rows)

When I run the following, I get the correct row + 36 other rows returned... when it should only be the one row:

SELECT TOP 1000 [ID]
      ,[MyChineseColumn]
      ,UNICODE([MyChineseColumn])
  FROM [dbo].[MyTableName]
  WHERE [MyChineseColumn]= N'㐅'

As you'd expect, the row with 㐅 is returned, but also the following: 〇, 宁, 㮸 and a bunch of others...

Anyone have any ideas what is going on here? This has really got me confused and I am not sure how to solve this one (tried "Googling" already)...

Thanks

I should also mention that most of the other rows are all querying perfectly fine...it's only a handful of "dodgy" ones like the above that I'd really like to figure out the reason for. Maybe it's a certain range of Unicode characters that are doing this? I haven't got a clue... — Matt
– Matt, Commented Feb 3, 2011 at 14:40
Since I don't have a font that can display 㐅 or 㐅, they look identical to me. Just as a info: the first (and second) 㐅 is U+3405 CJK UNIFIED IDEOGRAPH-3405, while the second one (the last character in the list of wrong results) is U+3BB8 CJK UNIFIED IDEOGRAPH-3BB8. — Joachim Sauer
– Joachim Sauer, Commented Feb 3, 2011 at 14:44
Thank you Martin... I did some research and found the collation I needed to use and it's working now. I'd like to mark this as the answer, but you've only commented. If you add it as an answer, I will mark it "ANSWERED" for you. :-) Thanks! — Matt
– Matt, Commented Feb 3, 2011 at 15:02
@Matt - Done! Just out of curiosity what collation were you using that treated those 4 characters all the same? Even under SQL_Latin1_General_CP1_CI_AI I got 2 rows back for declare @t TABLE (c nchar(1) collate SQL_Latin1_General_CP1_CI_AI) INSERT INTO @t values (N'㐅'),(N'〇'),(N'宁'),(N'㮸') SELECT DISTINCT c FROM @t — Martin Smith
– Martin Smith, Commented Feb 3, 2011 at 15:22

Martin Smith · Accepted Answer · 2011-02-03 15:17:23Z

1

Please check the column is using an appropriate Chinese collation as that will determine the semantics used in this type of comparison.

answered Feb 3, 2011 at 15:17

Martin Smith

457k97 gold badges777 silver badges887 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Lucero · Accepted Answer · 2011-02-03 15:45:26Z

0

You may want to try and use a binary collation, these characters seem to be somehow matched as identical (possibly by ignoring case and/or accents, depending on the used collation).

answered Feb 3, 2011 at 15:45

Lucero

60.4k9 gold badges127 silver badges154 bronze badges

Collectives™ on Stack Overflow

SQL Query and Unicode Issue

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related