0

I want to read text file which includes information about its encoding in its content. I don't know what encoding is used before I read the file. I use System.IO.File.ReadAllText for reading the file. How can I convert encoding without reading the file again?

I was trying to specify default encoding while reading the file and then converting it to final encoding, but it doesn't convert correctly:

string input = File.ReadAllText(filePath, Encoding.Default);
Encoding encoding = GetEncodingFromInput(input);
input = encoding.GetString(Encoding.Convert(Encoding.Default, encoding, Encoding.Default.GetBytes(input)));

Converted string doesn't contain the same characters as when it was read with correct encoding. Some characters are changed to question marks.

10
  • Don't. You can't recover text that was lost due to a wrong encoding. Use the correct encoding from the beginning, or don't specify one. ReadAllText will try to detect whether the file is UTF8/UTF16 and fall back to Default, ie the system's locale, if it can't Commented Sep 13, 2019 at 9:26
  • @Governor Note: Encoding.Default doesn't do what you seem to think it does... Encoding.Default is in fact specifying ANSI encoding for the current code page, which is a legacy encoding. Commented Sep 13, 2019 at 9:27
  • @MatthewWatson Not true this article states that Encoding.Default is "The default encoding for this .NET implementation" but also states that "Different computers can use different encodings as the default, and the default encoding can change on a single computer" Commented Sep 13, 2019 at 9:30
  • I recommend you just read the file as Encoding.Unicode. 1st of all it is the C# standard (source for this claim), and 2nd of all it is backwards compatible with ASCII and ANSI. So even if the file is encoded as ASCII or ANSI you will still read the right letters with Unicode Commented Sep 13, 2019 at 9:32
  • @MindSwipe I'm assuming that this the .Net Framework rather than .Net Core, in which case it will be an ANSI encoding. (.Net Core will always use UTF8) Commented Sep 13, 2019 at 9:34

3 Answers 3

4

I don't know what encoding is used before I read the file.

Usually files that self-declare their encoding somehow have a documented technique or method for finding it - check your file format's published documentation.

If not, here's a few common techniques:

  1. Look for a Unicode BOM in the first few bytes. You can do this by first reading the first 5 bytes from the file into a buffer (or 64-bit integer) and looking them up in a dictionary. This is what System.IO.StreamReader does by default.
    • You can see a list of known BOM byte sequences here: https://en.wikipedia.org/wiki/Byte_order_mark
    • Note that UTF-8 does not have a BOM - but many editors (well, just Visual Studio) will stick 0xEF 0xBB 0xBF at the beginning).
  2. If it's a text/*-family of file-formats, with the encoding declared in some kind of header then you can read the first kilobyte of the file into a buffer and interpret every consecutive byte valued under 0x7F as a character in an ASCII string, then use a simple parser (even String.IndexOf) or a Regex to look for your header's delimiter.
    • This technique is often used for HTML files where the HTTP header declaring the encoding isn't available and the program needs to look for <meta http-equiv="Content-Type" /> to get the encoding name.

I use System.IO.File.ReadAllText for reading the file. How can I convert encoding without reading the file again?

You don't. Only use ReadAllText for simple text/plain files with consistent and known encoding - for this scenario else you'll need to use Stream and StreamReader (and possibly BinaryReader) together.

Sign up to request clarification or add additional context in comments.

9 Comments

ReadAllText tries to detect the encoding from the BOM and falls back to Default. Even with the other classes one needs to know the encoding in advance
It's SIE format file and documentation says that currently it permits only IBM Extended 8-bit ASCII encoding, but it can be changed in the future and I wanted to handle this possibility. I think I will have to use this one and believe it's always the correct one.
@Governor IBM Extended 8-bit ASCII is the 437 codepage. Use Encoding.GetEncoding(437).
@PanagiotisKanavos: I know that and that's what I've been doing so far.
@Governor in that case the actual question should be how to detect a file's encoding, not how to convert encodings.
|
1

Use System.IO.File.ReadAllBytes to read the file, and then de-encode the byte array after you know which encoding you need, using something like: System.Text.Encoding.XXXX.GetString()

3 Comments

Thanks, but that's what I was trying to avoid.
@Governor you can't. ? are error characters which means the original characters are simply gone.
@Governor I don't know if i got you wrong, it looked like you wanted to avoid physically reading the file twice. You can do a single read using ReadAllBytes, and then Convert the Byte array once, check which encoding you need, and then convert it again from the original byte array which is still in memory. If you don't want to convert the whole thing twice, probably check the other answer by Dai
1

From various comments it appears the text is in the IBM Extended 8-bit ASCII codepage, also known as 437. To load files in that codepage use Encoding.GetEncoding(437), eg :

var cp437=Encoding.GetEncoding(437);
var input = File.ReadAllText(filePath, cp437);

The ? or characters are the conversion error replacement characters returned when trying to read text using the wrong codepage. It's not possible to recover the original text from them.

Encoding.Default is the system's default codepage, not some .NET-wide default. As the docs say:

The Default property in the .NET Framework In the .NET Framework on the Windows desktop, the Default property always gets the system's active code page and creates a Encoding object that corresponds to it. The active code page may be an ANSI code page, which includes the ASCII character set along with additional characters that vary by code page. Because all Default encodings based on ANSI code pages lose data, consider using the Encoding.UTF8 encoding instead. UTF-8 is often identical in the U+00 to U+7F range, but can encode characters outside the ASCII range without loss.

Finally, both File.ReadAllText and the StreamReader class it uses will try to detect the encoding from the file's BOM (Byte Order Marks) and fall back to UTF8 if no BOM is found.

Detecting codepages

There's no reliable way to detect the encoding as many codepages may use the same bytes. One can only identify bad matches reliably because the resulting text will contain

What one can do is load the file's bytes once and try multiple encodings, eliminating those that contain . Another step would be to check for expected non-English words or characters and eliminate the encodings that don't produce them.

Encoding.GetEncodings() will return all registered encodings. A rough method that finds probable encodings could be :

IEnumerable<Encoding> DetectEncodings(byte[] buffer)
{
    var candidates=from enc in Encoding.GetEncodings()
                   let text=enc.GetString(byte)
                   where !text.Contains('�')
                   select enc;
   return candidates;
}

or, using value tuples :

IEnumerable<(Encoding,string)> DetectEncodings(byte[] buffer)
{
    var candidates=from enc in Encoding.GetEncodings()
                   let text=enc.GetString(byte)
                   where !text.Contains('�')
                   select (enc,text);
   return candidates;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.