How to determine if a string can be represented using a given encoding

Question

Given a System.Text.Encoding instance and a string, how can I determine programmatically if that string can be represented using that encoding?

I am working on a serialization library, and when writing a string, I need to know if the string can be written as-is, or if it needs to be escaped.

I looked into the members of Encoding, but none seems to provide that information. One option might be to somehow create an equivalent instance of Encoding, but with a custom EncoderFallback that would capture whether it has been used, then attempting to convert the string to bytes using the encoding. This seems a bit hacky and not very efficient, though.

Although I understand your question, I don't see the relation with serialization, I don't get why you need that information. There are tons of systems out there that serialize strings w/o that information, hopefully. — Simon Mourier
– Simon Mourier, Commented Jan 18, 2016 at 17:45
While not strictly related to serialization, the problem I have is that the output format is intended to be human-readable. Therefore, I want to write text directly if the encoding supports it. Otherwise, the format supports escape characters to encode any code point in ASCII. — Antoine Aubry
– Antoine Aubry, Commented Jan 18, 2016 at 17:54
You can get lists of mappings between other character sets and Unicode such as these here: unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS. (I haven't run across any non-Unicode character sets that have more than one encoding.) — Tom Blodget
– Tom Blodget, Commented Jan 18, 2016 at 18:04
There are not so many "weird" Encoding classes. Encoding have a IsSingleByte property that you can check. If it's true, there are good chances that it will need escaping. Otherwise, the others are mostly UTFxx or Unicode so they don't need escaping. — Simon Mourier
– Simon Mourier, Commented Jan 18, 2016 at 18:07

Heinzi · Accepted Answer · 2017-04-19 13:13:32Z

5

I don't really like using exceptions for control flow, but the simplicity of this solution definitely beats creating a custom EncoderFallback:

public static bool CanBeEncoded(int codepage, string s)
{
    try
    {
        Encoding.GetEncoding(codepage,
                             EncoderFallback.ExceptionFallback,
                             DecoderFallback.ExceptionFallback).GetBytes(s);
        return true;
    }
    catch (EncoderFallbackException)
    {
        return false;
    }
}

Usage:

Console.WriteLine(CanBeEncoded(1252, "Grüß Gott!")); // Prints True
Console.WriteLine(CanBeEncoded(1252, "Привет"));     // Prints False

answered Apr 19, 2017 at 13:13

Heinzi

173k61 gold badges386 silver badges554 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Antoine Aubry · Accepted Answer · 2016-01-19 07:33:37Z

2

I solved this by encoding the string, decoding it, and then comparing it with the original. This seems terribly inefficient though.

Encoding targetEncoding = Encoding.GetEncoding(28595);
var text = "Гранит";

var encodedBytes = targetEncoding.GetBytes(text);
var decodedText = targetEncoding.GetString(encodedBytes);

var textCanBeRepresentedByTargetEncoding = decodedText.Equals(text);

answered Jan 19, 2016 at 7:33

Antoine Aubry

12.6k10 gold badges48 silver badges78 bronze badges

1 Comment

Simon Mourier Over a year ago

If you're after performance, I would definitely use the fact that UTFxx and Unicode are ok and all SingleByte are not (this represent all actually defined encodings in .NET), and use this algorithm as the last resort for encodings that don't fit exactly in these two categories (or derived classes).

Tobias Knauss · Accepted Answer · 2016-01-18 17:40:18Z

0

Afaik, a String in c# is always Unicode. In this case you could cycle over every character of the string and check if its numeric value fits into a certain coding. E.g. a unicode character with 0x1234 will not fit into ASCII range 0x00-0xFF (0x7F, to be accurate).

EDIT
ASCII: 7 (8) bits. The "8th bit" characters are codepage-dependant, meaning the same numeric value will look appear as a different character in different codepages. No chance to change that, afaik.
UTF7: should be very rare, and according to wikipedia it's no part of the standard.
UTF8: 8 bits, identical to ASCII on the 1st half.
UTF16,32: 16 resp. 32 bits.
Afaik, the character 0x1234 is the same in UTF16 and 32, but of course does not exist in UTF8.
Unfortunately I don't know any way to find out if a given character 0xAB was given in ASCII (and in which codepage) or UTF8. Actually, I doubt that there is a way at all...

edited Jan 18, 2016 at 17:40

answered Jan 18, 2016 at 17:23

Tobias Knauss

3,5791 gold badge25 silver badges52 bronze badges

5 Comments

Antoine Aubry Over a year ago

Sure, but how can I know which values are valid in a given encoding? If it is ASCII, that's easy, but I don't know which weird encoding I will be given.

Antoine Aubry Over a year ago

What about ISO-8859-1 and other encoding that I don't even know about? I have no control over which encodings I will receive.

Tom Blodget Over a year ago

"String in c# is always Unicode": Yes, the language specification states this from the outset. (Don't be afraid of the C# Specification; It's easy to read parts as you need them.)

Tobias Knauss Over a year ago

ISO 8859-1 is one of the codepages of extended ASCII. If you receive ASCII text data without any encoding info, you can INTERPRETE it as anything! In fact, you have received a collection of bytes. For correct interpretaion, you definitely NEED additional info.

Antoine Aubry Over a year ago

That's not what I meant. What I have is a string, and a TextWriter. I don't control the encoding of the writer. I have the option to escape non-ascii characters, but I want to keep the file as human-readable as possible. That's why I want to know whether the string can be represented using the TextWriter's encoding before writing to it.

Collectives™ on Stack Overflow

How to determine if a string can be represented using a given encoding

3 Answers 3

Comments

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related