2

Given a System.Text.Encoding instance and a string, how can I determine programmatically if that string can be represented using that encoding?

I am working on a serialization library, and when writing a string, I need to know if the string can be written as-is, or if it needs to be escaped.

I looked into the members of Encoding, but none seems to provide that information. One option might be to somehow create an equivalent instance of Encoding, but with a custom EncoderFallback that would capture whether it has been used, then attempting to convert the string to bytes using the encoding. This seems a bit hacky and not very efficient, though.

4
  • Although I understand your question, I don't see the relation with serialization, I don't get why you need that information. There are tons of systems out there that serialize strings w/o that information, hopefully. Commented Jan 18, 2016 at 17:45
  • While not strictly related to serialization, the problem I have is that the output format is intended to be human-readable. Therefore, I want to write text directly if the encoding supports it. Otherwise, the format supports escape characters to encode any code point in ASCII. Commented Jan 18, 2016 at 17:54
  • You can get lists of mappings between other character sets and Unicode such as these here: unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS. (I haven't run across any non-Unicode character sets that have more than one encoding.) Commented Jan 18, 2016 at 18:04
  • There are not so many "weird" Encoding classes. Encoding have a IsSingleByte property that you can check. If it's true, there are good chances that it will need escaping. Otherwise, the others are mostly UTFxx or Unicode so they don't need escaping. Commented Jan 18, 2016 at 18:07

3 Answers 3

5

I don't really like using exceptions for control flow, but the simplicity of this solution definitely beats creating a custom EncoderFallback:

public static bool CanBeEncoded(int codepage, string s)
{
    try
    {
        Encoding.GetEncoding(codepage,
                             EncoderFallback.ExceptionFallback,
                             DecoderFallback.ExceptionFallback).GetBytes(s);
        return true;
    }
    catch (EncoderFallbackException)
    {
        return false;
    }
}

Usage:

Console.WriteLine(CanBeEncoded(1252, "Grüß Gott!")); // Prints True
Console.WriteLine(CanBeEncoded(1252, "Привет"));     // Prints False
Sign up to request clarification or add additional context in comments.

Comments

2

I solved this by encoding the string, decoding it, and then comparing it with the original. This seems terribly inefficient though.

Encoding targetEncoding = Encoding.GetEncoding(28595);
var text = "Гранит";

var encodedBytes = targetEncoding.GetBytes(text);
var decodedText = targetEncoding.GetString(encodedBytes);

var textCanBeRepresentedByTargetEncoding = decodedText.Equals(text);

1 Comment

If you're after performance, I would definitely use the fact that UTFxx and Unicode are ok and all SingleByte are not (this represent all actually defined encodings in .NET), and use this algorithm as the last resort for encodings that don't fit exactly in these two categories (or derived classes).
0

Afaik, a String in c# is always Unicode. In this case you could cycle over every character of the string and check if its numeric value fits into a certain coding. E.g. a unicode character with 0x1234 will not fit into ASCII range 0x00-0xFF (0x7F, to be accurate).

EDIT
ASCII: 7 (8) bits. The "8th bit" characters are codepage-dependant, meaning the same numeric value will look appear as a different character in different codepages. No chance to change that, afaik.
UTF7: should be very rare, and according to wikipedia it's no part of the standard.
UTF8: 8 bits, identical to ASCII on the 1st half.
UTF16,32: 16 resp. 32 bits.
Afaik, the character 0x1234 is the same in UTF16 and 32, but of course does not exist in UTF8.
Unfortunately I don't know any way to find out if a given character 0xAB was given in ASCII (and in which codepage) or UTF8. Actually, I doubt that there is a way at all...

5 Comments

Sure, but how can I know which values are valid in a given encoding? If it is ASCII, that's easy, but I don't know which weird encoding I will be given.
What about ISO-8859-1 and other encoding that I don't even know about? I have no control over which encodings I will receive.
"String in c# is always Unicode": Yes, the language specification states this from the outset. (Don't be afraid of the C# Specification; It's easy to read parts as you need them.)
ISO 8859-1 is one of the codepages of extended ASCII. If you receive ASCII text data without any encoding info, you can INTERPRETE it as anything! In fact, you have received a collection of bytes. For correct interpretaion, you definitely NEED additional info.
That's not what I meant. What I have is a string, and a TextWriter. I don't control the encoding of the writer. I have the option to escape non-ascii characters, but I want to keep the file as human-readable as possible. That's why I want to know whether the string can be represented using the TextWriter's encoding before writing to it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.