Converting a byte array to a string given encoding

Question

I read from a file to a byte array:

auto text = cast(immutable(ubyte)[]) read("test.txt");

I can get the type of character encoding using the following function:

enum EncodingType {ANSI, UTF8, UTF16LE, UTF16BE, UTF32LE, UTF32BE}

EncodingType DetectEncoding(immutable(ubyte)[] data){
  switch (data[0]){
    case 0xEF:
      if (data[1] == 0xBB && data[2] == 0xBF){
        return EncodingType.UTF8;
      } break;
    case 0xFE:
      if (data[1] == 0xFF){
        return EncodingType.UTF16BE;
      } break;
    case 0xFF:
      if (data[1] == 0xFE){
        if (data[2] == 0x00 && data[3] == 0x00){
          return EncodingType.UTF32LE;
        }else{
          return EncodingType.UTF16LE;
        }
      }
    case 0x00:
      if (data[1] == 0x00 && data[2] == 0xFE && data[3] == 0xFF){
        return EncodingType.UTF32BE;
      }
    default:
      break;
  }
  return EncodingType.ANSI;
}

I need a function that takes a byte array and returns the text string (utf-8). If the text is encoded in UTF-8, then the transformation is trivial. Similarly, if the encoding is UTF-16 or UTF-32 native byte order for the system.

string TextDataToString(immutable(ubyte)[] data){
  import std.utf;
  final switch (DetectEncoding(data[0..4])){
    case EncodingType.ANSI:
      return null;/*???*/
    case EncodingType.UTF8:
      return cast(string) data[3..$];
    case EncodingType.UTF16LE:
      wstring result;
      version(LittleEndian) { result = cast(wstring) data[2..$]; }
      version(BigEndian) { result = "";/*???*/ }
      return toUTF8(result);
    case EncodingType.UTF16BE:
      return null;/*???*/
    case EncodingType.UTF32LE:
      dstring result;
      version(LittleEndian) { result = cast(dstring) data[4..$]; }
      version(BigEndian) { result = "";/*???*/ }
      return toUTF8(result);
    case EncodingType.UTF32BE:
      return null;/*???*/
  }
}

But I could not figure out how to convert byte array with ANSI encoded text (for example, windows-1251) or UTF-16/32 with NOT native byte order. I ticked the appropriate places in the code with /*???*/.

As a result, the following code should work, with any encoding of a text file:

string s = TextDataToString(text);
writeln(s);

Please help!

user3777262 · Accepted Answer · 2014-06-25 23:01:26Z

4

BOMs are optional. You cannot use them to reliably detect the encoding. Even if there is a BOM, using it to distinguish UTF from code page encodings is problematic, because the byte sequences are usually valid (if nonsensical) in those, too. E.g. 0xFE 0xFF is "юя" in Windows-1251.

Even if you could tell UTF from code page encodings, you couldn't tell the different code pages from another. You could analyze the whole text and make guesses, but that's super error prone and not very practical.

So, I'd advise you to not try to detect the encoding. Instead, require a specific encoding, or add a mechanism to specify it.

As for trandscoding from a different byte order, example for UTF16BE:

import std.algorithm: map;
import std.bitmanip: bigEndianToNative;
import std.conv: to;
import std.exception: enforce;
import std.range: chunks;

alias C = wchar;
enforce(data.length % C.sizeof == 0);
auto result = data
    .chunks(C.sizeof)
    .map!(x => bigEndianToNative!C(x[0 .. C.sizeof]))
    .to!string;

answered Jun 25, 2014 at 23:01

user3777262

561 bronze badge

Sign up to request clarification or add additional context in comments.

2 Comments

Adam D. Ruppe Over a year ago

This module can do a little bit of ansi transcoding too: dlang.org/phobos/std_encoding.html though it is pretty minimal. Also, my characterencodings.d found here github.com/adamdruppe/arsd/blob/master/characterencodings.d supports more and has fairly simple operation: string s = convertToUtf8(data_as_bytes, "windows-1251"); for example - give the encoding as a string. The function tryToDetermineEncoding looks for a BOM like the OP though I didn't handle big endian! Your code would be a nice addition too.

ApceH Hypocrite Over a year ago

Thanks for the code! Yes, I have come to the conclusion too that it is better to allow to specify the encoding right before reading a file. But, to expect UTF-8 by default, also not correct. I need to take at least some steps to determine the encoding.

Collectives™ on Stack Overflow

Converting a byte array to a string given encoding

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related