Detect encoding of a string in C/C++

Question

Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.

Thanks

What are the possible encodings you expect? Is there a small collection of possible ones, or could it be just any? — Kerrek SB
– Kerrek SB, Commented Sep 23, 2011 at 1:09
What environment are you using? I think there's a library to do this under Linux that is portable to windows. — Albert Perrien
– Albert Perrien, Commented Sep 23, 2011 at 1:19
Thanks all, K-ballo, Kerrek: it could be UTF8, UCS2/UTF16, or ANSI ; AlbertPerrien: I'm using windows, btw, what is the lib's name? — jAckOdE
– jAckOdE, Commented Sep 23, 2011 at 4:25

MSN · Accepted Answer · 2011-09-23 01:42:01Z

13

Assuming you know the length of the input array, you can make the following guesses:

First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
If any character is from 0x80 to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.
At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.

answered Sep 23, 2011 at 1:42

MSN

54.8k7 gold badges79 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

russw_uk · Accepted Answer · 2011-09-23 01:49:01Z

5

It's not an easy problem to solve, and generally relies on heuristics to take a best guess at what the input encoding is, which can be tripped up by relatively innocuous inputs - for example, take a look at this Wikipedia article and The Notepad file encoding Redux for more details.

If you're looking for a Windows-only solution with minimal dependencies, you can look at using a combination of IsTextUnicode and MLang's DetectInputCodePage to attempt character set detection.

If you are looking for portability, but don't mind taking on a fairly large dependency in the form of ICU then you can make use of it's character set detection routines to achieve the same thing in a portable manner.

answered Sep 23, 2011 at 1:49

russw_uk

1,2678 silver badges10 bronze badges

Comments

Violet Giraffe · Accepted Answer · 2020-09-26 17:51:16Z

3

I have written a small C++ library for detecting text file encoding. It uses Qt, but it can be just as easily implemented using just the standard library.

It operates by measuring symbol occurrence statistics and comparing it to pre-computed reference values in different encodings and languages. As a result, it not only detects encoding but also the language of the text. The downside is that pre-computed statistics must be provided for the target language to detect this language properly.

https://github.com/VioletGiraffe/text-encoding-detector

answered Sep 26, 2020 at 17:51

Violet Giraffe

33.8k56 gold badges212 silver badges366 bronze badges

Collectives™ on Stack Overflow

Detect encoding of a string in C/C++

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related