How to easily detect utf8 encoding in the string?

Question

I have string which fill up by data from other program and this data can be with UTF8 encoding or not. So if not i can encode to UTF8 but what is the best way to detect UTF8 in the C++? I saw this variant https://stackoverflow.com/questions/... but there are comments which said that this solutions give not 100% detection. So if i do encoding to UTF8 string which already contain UTF8 data then i write wrong text to database.

So can i just use this UTF8 detection :

bool is_utf8(const char * string)
{
    if(!string)
        return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
        if( (// ASCII
             // use bytes[0] <= 0x7F to allow ASCII control characters
                bytes[0] == 0x09 ||
                bytes[0] == 0x0A ||
                bytes[0] == 0x0D ||
                (0x20 <= bytes[0] && bytes[0] <= 0x7E)
            )
        ) {
            bytes += 1;
            continue;
        }

        if( (// non-overlong 2-byte
                (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF)
            )
        ) {
            bytes += 2;
            continue;
        }

        if( (// excluding overlongs
                bytes[0] == 0xE0 &&
                (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// straight 3-byte
                ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
                    bytes[0] == 0xEE ||
                    bytes[0] == 0xEF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// excluding surrogates
                bytes[0] == 0xED &&
                (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            )
        ) {
            bytes += 3;
            continue;
        }

        if( (// planes 1-3
                bytes[0] == 0xF0 &&
                (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// planes 4-15
                (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// plane 16
                bytes[0] == 0xF4 &&
                (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            )
        ) {
            bytes += 4;
            continue;
        }

        return 0;
    }

    return 1;
}

And this code for encoding to UTF8 if detection is not true :

     string text;
     if(!is_utf8(EscReason.c_str()))
     {
        int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), 0, 0);
        std::wstring utf16_str(size, '\0');

        MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), &utf16_str[0], size);
    
        int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), 0, 0, 0, 0);

        std::string utf8_str(utf8_size, '\0');
        WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), &utf8_str[0], utf8_size, 0, 0);

        text = utf8_str;
     }

Or code above is not done properly? Also i do it in the Windows 7. And how about Ubuntu? Does this variant work there?

Remy Lebeau · Accepted Answer · 2015-02-04 00:51:56Z

20

Comparing whole byte values is not the correct way to detect UTF-8. You have to analyze the actual bit patterns of each byte. UTF-8 uses a very distinct bit pattern that no other encoding uses. Try something more like this instead:

bool is_utf8(const char * string)
{
    if (!string)
        return true;

    const unsigned char * bytes = (const unsigned char *)string;
    int num;

    while (*bytes != 0x00)
    {
        if ((*bytes & 0x80) == 0x00)
        {
            // U+0000 to U+007F 
            num = 1;
        }
        else if ((*bytes & 0xE0) == 0xC0)
        {
            // U+0080 to U+07FF 
            num = 2;
        }
        else if ((*bytes & 0xF0) == 0xE0)
        {
            // U+0800 to U+FFFF 
            num = 3;
        }
        else if ((*bytes & 0xF8) == 0xF0)
        {
            // U+10000 to U+10FFFF 
            num = 4;
        }
        else
            return false;

        bytes += 1;
        for (int i = 1; i < num; ++i)
        {
            if ((*bytes & 0xC0) != 0x80)
                return false;
            bytes += 1;
        }
    }

    return true;
}

Now, this does not take into account illegal UTF-8 sequences, such as overlong encodings, UTF-16 surrogates, and codepoints above U+10FFFF. If you want to make sure the UTF-8 is both valid and correct, you would need something more like this:

bool is_valid_utf8(const char * string)
{
    if (!string)
        return true;

    const unsigned char * bytes = (const unsigned char *)string;
    unsigned int cp;
    int num;

    while (*bytes != 0x00)
    {
        if ((*bytes & 0x80) == 0x00)
        {
            // U+0000 to U+007F 
            cp = (*bytes & 0x7F);
            num = 1;
        }
        else if ((*bytes & 0xE0) == 0xC0)
        {
            // U+0080 to U+07FF 
            cp = (*bytes & 0x1F);
            num = 2;
        }
        else if ((*bytes & 0xF0) == 0xE0)
        {
            // U+0800 to U+FFFF 
            cp = (*bytes & 0x0F);
            num = 3;
        }
        else if ((*bytes & 0xF8) == 0xF0)
        {
            // U+10000 to U+10FFFF 
            cp = (*bytes & 0x07);
            num = 4;
        }
        else
            return false;

        bytes += 1;
        for (int i = 1; i < num; ++i)
        {
            if ((*bytes & 0xC0) != 0x80)
                return false;
            cp = (cp << 6) | (*bytes & 0x3F);
            bytes += 1;
        }

        if ((cp > 0x10FFFF) ||
            ((cp >= 0xD800) && (cp <= 0xDFFF)) ||
            ((cp <= 0x007F) && (num != 1)) ||
            ((cp >= 0x0080) && (cp <= 0x07FF) && (num != 2)) ||
            ((cp >= 0x0800) && (cp <= 0xFFFF) && (num != 3)) ||
            ((cp >= 0x10000) && (cp <= 0x1FFFFF) && (num != 4)))
            return false;
    }

    return true;
}

answered Feb 4, 2015 at 0:51

Remy Lebeau

609k36 gold badges516 silver badges875 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

ahmed allam Over a year ago

how (*bytes & 0xE0) == 0xC0 gives range from 0x80 to 0x7ff....???it should give range from 0xc0 to 0xdf

Remy Lebeau Over a year ago

@ahmedallam no, what I wrote is correct. Look at the bit pattern table described on Wikipedia for UTF-8. Unicode codepoints U+0080 to U+07FF (not bytes 0xC0 to 0xDF) are encoded in 2 bytes using the bit pattern 110xxxxx 10xxxxxx. 0xE0 is bits 11100000 and 0xC0 is bits 11000000. So, if ((*bytes & 0xE0) == 0xC0) is checking if the high 3 bits of the 1st byte are 110 before (*bytes & 0x1F) grabs the low 5 bits. Then later, ((*bytes & 0xC0) != 0x80) checks if the high 2 bits of the 2nd byte are 10 before (*bytes & 0x3F) grabs the low 6 bits.

Remy Lebeau Over a year ago

@ahmedallam seems you need to brush up on how bits, bit masks, and bitwise operators work.

Mecanik Over a year ago

@RemyLebeau Is this exception/thread safe ? (noob question)

Remy Lebeau Over a year ago

@NorbertBoros as long as the string parameter is pointing at a valid C-style null-terminated string, and that memory is not modified or freed by another thread while the function is running, then yes, the function is safe. Otherwise, its behavior is undefined.

|

MSalters · Accepted Answer · 2015-02-02 09:29:22Z

9

You probably don't understand UTF-8 and the alternatives. There are only 256 possible values for a byte. That's not a lot, given the number of characters. As a result, many byte sequences are both valid UTF-8 strings and valid strings in other encodings.

In fact, every ASCII string is intentionally a valid UTF-8 string with essentially the same meaning. Your code would return true for ìs_utf8("Hello").

Even many other non-UTF8, non-ASCII strings share a byte sequences with valid UTF-8 strings. And there's just no way to convert a non-UTF-8 string to UTF-8 without knowing exactly what kind of non-UTF-8 encoding it is. Even Latin-1 and Latin-2 are already quite different. CP_ACP is even worse than Latin-1, CP_ACP isn't even the same everywhere.

Your text must go into the database as UTF-8. Thus, if it isn't yet UTF-8, it must be converted, and you must know the exact source encoding. There is no magical escape.

On Linux, iconv is the usual method to convert between 2 encodings.

answered Feb 2, 2015 at 9:29

MSalters

182k11 gold badges171 silver badges376 bronze badges

1 Comment

Sam Ginrich Over a year ago

The question does not do the technical break down, but I think it is understandable, that a "UTF-8 stream" by it's grammar is a subclass of a "byte stream" and independent from the type of extension of the 7-Bit ASC II character set. Only the difference class is detectable.

Daniel · Accepted Answer · 2024-01-11 22:15:45Z

Simple validation of null-terminated UTF-8 string (C++20)

#include <cassert>
#include <bit>

constexpr bool validate_utf8(const char* string) noexcept
{
    assert(string != nullptr);

    while (*string)
    {
        switch (std::countl_one(static_cast<unsigned char>(*string)))
        {
            [[unlikely]] case 4: ++string; if (std::countl_one(static_cast<unsigned char>(*string)) != 1) return false; [[fallthrough]];
            [[unlikely]] case 3: ++string; if (std::countl_one(static_cast<unsigned char>(*string)) != 1) return false; [[fallthrough]];
            [[unlikely]] case 2: ++string; if (std::countl_one(static_cast<unsigned char>(*string)) != 1) return false; [[fallthrough]];
              [[likely]] case 0: ++string; break;
            [[unlikely]] default: return false;
        }
    }

    return true;
}

Another function (C++14) that is 3 times faster than the previous code. At least on the GCC. But the code above is elegant.

#include <cassert>

constexpr bool validate_utf8(const char* string) noexcept
{
    assert(string != nullptr);

    while (*string)
    {
        if ((*string & 0b10000000) != 0)
        {
            if ((*string & 0b01000000) == 0) return false;
            if ((*string & 0b00100000) != 0)
            {
                if ((*string & 0b00010000) != 0)
                {
                    if ((*string & 0b00001000) != 0)
                        return false;

                    if ((*++string & 0b11000000) != 0b10000000)
                        return false;
                }

                if ((*++string & 0b11000000) != 0b10000000)
                    return false;
            }

            if ((*++string & 0b11000000) != 0b10000000)
                return false;
        }

        ++string;
    }

    return true;
}

std::countl_one only exists since C++20, this is the first I've heard of it.

user2138149 · Accepted Answer · 2025-10-22 20:20:50Z

This is not the kind of function you want to write yourself. I would suggest you look at using simdjson, which is what I am using for this purpose.

Don't be deterred by the fact the library is called simdjson, in other words, because it has the word "JSON" in it. It also contains a function for validating UTF-8 strings.

const char * some_string = "[ 1, 2, 3, 4] ";
size_t length = std::strlen(some_string);
bool is_ok = simdjson::validate_utf8(some_string, length);

You can find a useful blog post about this function here.

If you don't want to import the whole of simdjson into your project, my understanding it it is possible to use simdutf instead.

I'm not sure if simdjson depends on simdutf, or uses parts of it. It may do.

Collectives™ on Stack Overflow

How to easily detect utf8 encoding in the string?

4 Answers 4

7 Comments

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related