2

I have an array of UCS-2LE encoded bytes in Ruby and since this is my complete beginning with Ruby I'm struggling to convert it to UTF-8 string, I have the same code in PHP & Java working just fine.

In PHP I'm using iconv library, but in Ruby iconv has been deprecated:

$str = iconv('UCS-2LE', 'UTF-8//IGNORE', implode($byte_array));

In Java I'm using:

str = new String(byte_array, "UTF-16LE");

Bytes in the array are encoded as 2 bytes per 1 character, how to perform similar conversion in Ruby? I've tried a few solutions but it didn't work for me. Thank you.

6
  • 1
    Did you read stackoverflow.com/questions/1033104/… ? Commented Jul 23, 2014 at 9:38
  • 1
    byte_array.pack("C*").force_encoding("UTF-16LE").encode("UTF-8") should work Commented Jul 23, 2014 at 9:53
  • @Stefan it works just fine, I build the array putting items as .chr types, I've removed .chr and add your code and it works just fine, one thing I don't understand, how does it work with C* type while the documentation states that C is a char (and not wide char)? Commented Jul 23, 2014 at 14:47
  • 1
    C interprets an integer value as a 1-byte char, i.e. [65].pack("C") converts 65 (0x41) to "A" ("\x41"). The result is a string with ASCII-8BIT encoding. force_encoding then reinterprets the bytes. Commented Jul 23, 2014 at 15:25
  • ok I get it, thank you :) Commented Jul 23, 2014 at 15:30

1 Answer 1

7

Assuming a byte array:

byte_array = [70, 0, 111, 0, 111, 0]

You can use Array#pack to convert the integer values to characters (C treats each integer as an unsigned char):

string = byte_array.pack("C*")       #=> "F\x00o\x00o\x00"

pack returns a string with ASCII-8BIT encoding:

string.encoding                      #=> #<Encoding:ASCII-8BIT>

You can now use String#force_encoding to reinterpret the bytes as an UTF-16 string:

string.force_encoding("UTF-16LE")    #=> "Foo"

The bytes haven't changed so far:

string.bytes                         #=> [70, 0, 111, 0, 111, 0]

To transcode the string into another encoding, use String#encode:

utf8_string = string.encode("UTF-8") #=> "Foo"
utf8_string.bytes                    #=> [70, 111, 111]

The whole conversion can be written in a single line:

byte_array.pack("C*").force_encoding("UTF-16LE").encode("UTF-8")

or by passing the source encoding as a 2nd argument to encode:

byte_array.pack("C*").encode("UTF-8", "UTF-16LE")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.