Converting strings with wrong encoding returned by MySQL

Question

After migration from ruby1.8/mysql gem to ruby1.9/mysql2 I get strings from the legacy db that are reported to be utf8, but look like encoded with latin1 (or, probably have some kind of double encoding, as straight force_encoding does not help).

String example:

Ñ„Ñ‹Ð²Ð°Ð¿Ñ€Ð¾Ð»Ð´Ð¶Ñ - just a test string - Ð¹Ñ†ÑƒÐºÐµÐ½Ð³ÑˆÑ‰Ð·Ñ…ÑŠ

I want to be able to convert it to

фывапролджэ - just a test string - йцукенгшщзхъ

Can somebody help with conversion a) with ruby code, and/or b) with SQL?

As copy-paste may loose some info, bytes from the returned string: [195, 145, 226, 128, 158, 195, 145, 226, 128, 185, 195, 144, 194, 178, 195, 144, 194, 176, 195, 144, 194, 191, 195, 145, 226, 130, 172, 195, 144, 194, 190, 195, 144, 194, 187, 195, 144, 194, 180, 195, 144, 194, 182, 195, 145, 194, 141, 32, 45, 32, 106, 117, 115, 116, 32, 97, 32, 116, 101, 115, 116, 32, 115, 116, 114, 105, 110, 103, 32, 45, 32, 195, 144, 194, 185, 195, 145, 226, 128, 160, 195, 145, 198, 146, 195, 144, 194, 186, 195, 144, 194, 181, 195, 144, 194, 189, 195, 144, 194, 179, 195, 145, 203, 134, 195, 145, 226, 128, 176, 195, 144, 194, 183, 195, 145, 226, 128, 166, 195, 145, 197, 160]

I've had this in the past. It's easy on ruby 1.8 to send UTF8 bytes and have them stored incorrectly as Latin 1. You need to fix the issue MySQL side by first converting the column to a blob, and then back to a utf8 string/text column. This makes MySQL reinterpret the data — Frederick Cheung
– Frederick Cheung, Commented Apr 19, 2013 at 17:34
Thanks for a hint in proper direction. Please see my answer below (could not fit content in comment) — UncleGene
– UncleGene, Commented Apr 19, 2013 at 17:59

Community · Accepted Answer · 2017-05-23 12:18:55Z

1

OK, I found an SQL solution for this in How to fix double-encoded UTF8 characters (in an utf-8 table).

CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8)

Any takers for Ruby?

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Apr 19, 2013 at 17:59

UncleGene

2,14216 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

tadman Over a year ago

If your database content is mangled, you need to de-mangle it there. No amount of fussing in Ruby will fix it.

UncleGene Over a year ago

Benefit of ruby conversion is that it can be handled lazily (apparently the approach is idempotent). Database conversion is tricky with timing - with any deployment strategy you have old clients looking in new records and/or new clients looking at old ones.

UncleGene Over a year ago

Disregard the idempotency comment, it related to a wrong solution

tadman Over a year ago

You can do an in-place UPDATE if you want to fix the problem permanently. I'd suggest doing that on a test copy of the database before going at your production data. I've had data doubly encoded as UTF-8 before due to a connection misconfiguration. Annoying to fix, but necessary. Depending on the amount of data you need to convert, this could take anywhere from minutes to hours.

tadman · Accepted Answer · 2013-04-19 17:21:58Z

0

You might want to set the encoding of your database connection in config/database.yml, experiment with different settings until you get the desired result.

It could be your connection is defaulting to latin1 for some reason, but being reinterpreted as UTF8 internally.

answered Apr 19, 2013 at 17:21

tadman

212k23 gold badges237 silver badges266 bronze badges

1 Comment

UncleGene Over a year ago

I tried all possible connection encoding settings, and I get garbage for all of them. Also, I want the code to use utf8 - all new records are recorded/retrieved correctly, the issue is only with legacy records

Collectives™ on Stack Overflow

Converting strings with wrong encoding returned by MySQL

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related