I read from a file to a byte array:
auto text = cast(immutable(ubyte)[]) read("test.txt");
I can get the type of character encoding using the following function:
enum EncodingType {ANSI, UTF8, UTF16LE, UTF16BE, UTF32LE, UTF32BE}
EncodingType DetectEncoding(immutable(ubyte)[] data){
switch (data[0]){
case 0xEF:
if (data[1] == 0xBB && data[2] == 0xBF){
return EncodingType.UTF8;
} break;
case 0xFE:
if (data[1] == 0xFF){
return EncodingType.UTF16BE;
} break;
case 0xFF:
if (data[1] == 0xFE){
if (data[2] == 0x00 && data[3] == 0x00){
return EncodingType.UTF32LE;
}else{
return EncodingType.UTF16LE;
}
}
case 0x00:
if (data[1] == 0x00 && data[2] == 0xFE && data[3] == 0xFF){
return EncodingType.UTF32BE;
}
default:
break;
}
return EncodingType.ANSI;
}
I need a function that takes a byte array and returns the text string (utf-8). If the text is encoded in UTF-8, then the transformation is trivial. Similarly, if the encoding is UTF-16 or UTF-32 native byte order for the system.
string TextDataToString(immutable(ubyte)[] data){
import std.utf;
final switch (DetectEncoding(data[0..4])){
case EncodingType.ANSI:
return null;/*???*/
case EncodingType.UTF8:
return cast(string) data[3..$];
case EncodingType.UTF16LE:
wstring result;
version(LittleEndian) { result = cast(wstring) data[2..$]; }
version(BigEndian) { result = "";/*???*/ }
return toUTF8(result);
case EncodingType.UTF16BE:
return null;/*???*/
case EncodingType.UTF32LE:
dstring result;
version(LittleEndian) { result = cast(dstring) data[4..$]; }
version(BigEndian) { result = "";/*???*/ }
return toUTF8(result);
case EncodingType.UTF32BE:
return null;/*???*/
}
}
But I could not figure out how to convert byte array with ANSI encoded text (for example, windows-1251) or UTF-16/32 with NOT native byte order.
I ticked the appropriate places in the code with /*???*/.
As a result, the following code should work, with any encoding of a text file:
string s = TextDataToString(text);
writeln(s);
Please help!