1

After migration from ruby1.8/mysql gem to ruby1.9/mysql2 I get strings from the legacy db that are reported to be utf8, but look like encoded with latin1 (or, probably have some kind of double encoding, as straight force_encoding does not help).

String example:

Ñ„Ñ‹Ð²Ð°Ð¿Ñ€Ð¾Ð»Ð´Ð¶Ñ - just a test string - йцукенгшщзхъ

I want to be able to convert it to

фывапролджэ - just a test string - йцукенгшщзхъ

Can somebody help with conversion a) with ruby code, and/or b) with SQL?

As copy-paste may loose some info, bytes from the returned string: [195, 145, 226, 128, 158, 195, 145, 226, 128, 185, 195, 144, 194, 178, 195, 144, 194, 176, 195, 144, 194, 191, 195, 145, 226, 130, 172, 195, 144, 194, 190, 195, 144, 194, 187, 195, 144, 194, 180, 195, 144, 194, 182, 195, 145, 194, 141, 32, 45, 32, 106, 117, 115, 116, 32, 97, 32, 116, 101, 115, 116, 32, 115, 116, 114, 105, 110, 103, 32, 45, 32, 195, 144, 194, 185, 195, 145, 226, 128, 160, 195, 145, 198, 146, 195, 144, 194, 186, 195, 144, 194, 181, 195, 144, 194, 189, 195, 144, 194, 179, 195, 145, 203, 134, 195, 145, 226, 128, 176, 195, 144, 194, 183, 195, 145, 226, 128, 166, 195, 145, 197, 160]

2
  • 1
    I've had this in the past. It's easy on ruby 1.8 to send UTF8 bytes and have them stored incorrectly as Latin 1. You need to fix the issue MySQL side by first converting the column to a blob, and then back to a utf8 string/text column. This makes MySQL reinterpret the data Commented Apr 19, 2013 at 17:34
  • Thanks for a hint in proper direction. Please see my answer below (could not fit content in comment) Commented Apr 19, 2013 at 17:59

2 Answers 2

1

OK, I found an SQL solution for this in How to fix double-encoded UTF8 characters (in an utf-8 table).

CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8)

Any takers for Ruby?

Sign up to request clarification or add additional context in comments.

4 Comments

If your database content is mangled, you need to de-mangle it there. No amount of fussing in Ruby will fix it.
Benefit of ruby conversion is that it can be handled lazily (apparently the approach is idempotent). Database conversion is tricky with timing - with any deployment strategy you have old clients looking in new records and/or new clients looking at old ones.
Disregard the idempotency comment, it related to a wrong solution
You can do an in-place UPDATE if you want to fix the problem permanently. I'd suggest doing that on a test copy of the database before going at your production data. I've had data doubly encoded as UTF-8 before due to a connection misconfiguration. Annoying to fix, but necessary. Depending on the amount of data you need to convert, this could take anywhere from minutes to hours.
0

You might want to set the encoding of your database connection in config/database.yml, experiment with different settings until you get the desired result.

It could be your connection is defaulting to latin1 for some reason, but being reinterpreted as UTF8 internally.

1 Comment

I tried all possible connection encoding settings, and I get garbage for all of them. Also, I want the code to use utf8 - all new records are recorded/retrieved correctly, the issue is only with legacy records

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.