How to convert array of UCS-2 bytes to UTF-8 string in Ruby?

Question

I have an array of UCS-2LE encoded bytes in Ruby and since this is my complete beginning with Ruby I'm struggling to convert it to UTF-8 string, I have the same code in PHP & Java working just fine.

In PHP I'm using iconv library, but in Ruby iconv has been deprecated:

$str = iconv('UCS-2LE', 'UTF-8//IGNORE', implode($byte_array));

In Java I'm using:

str = new String(byte_array, "UTF-16LE");

Bytes in the array are encoded as 2 bytes per 1 character, how to perform similar conversion in Ruby? I've tried a few solutions but it didn't work for me. Thank you.

byte_array.pack("C*").force_encoding("UTF-16LE").encode("UTF-8") should work — Stefan
– Stefan, Commented Jul 23, 2014 at 9:53
@Stefan it works just fine, I build the array putting items as .chr types, I've removed .chr and add your code and it works just fine, one thing I don't understand, how does it work with C* type while the documentation states that C is a char (and not wide char)? — user205036
– user205036, Commented Jul 23, 2014 at 14:47
C interprets an integer value as a 1-byte char, i.e. [65].pack("C") converts 65 (0x41) to "A" ("\x41"). The result is a string with ASCII-8BIT encoding. force_encoding then reinterprets the bytes. — Stefan
– Stefan, Commented Jul 23, 2014 at 15:25

Stefan · Accepted Answer · 2020-05-06 12:56:26Z

Assuming a byte array:

byte_array = [70, 0, 111, 0, 111, 0]

You can use Array#pack to convert the integer values to characters (C treats each integer as an unsigned char):

string = byte_array.pack("C*")       #=> "F\x00o\x00o\x00"

pack returns a string with ASCII-8BIT encoding:

string.encoding                      #=> #<Encoding:ASCII-8BIT>

You can now use String#force_encoding to reinterpret the bytes as an UTF-16 string:

string.force_encoding("UTF-16LE")    #=> "Foo"

The bytes haven't changed so far:

string.bytes                         #=> [70, 0, 111, 0, 111, 0]

To transcode the string into another encoding, use String#encode:

utf8_string = string.encode("UTF-8") #=> "Foo"
utf8_string.bytes                    #=> [70, 111, 111]

The whole conversion can be written in a single line:

byte_array.pack("C*").force_encoding("UTF-16LE").encode("UTF-8")

or by passing the source encoding as a 2nd argument to encode:

byte_array.pack("C*").encode("UTF-8", "UTF-16LE")

Collectives™ on Stack Overflow

How to convert array of UCS-2 bytes to UTF-8 string in Ruby?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related