11

I have string which fill up by data from other program and this data can be with UTF8 encoding or not. So if not i can encode to UTF8 but what is the best way to detect UTF8 in the C++? I saw this variant https://stackoverflow.com/questions/... but there are comments which said that this solutions give not 100% detection. So if i do encoding to UTF8 string which already contain UTF8 data then i write wrong text to database.

So can i just use this UTF8 detection :

bool is_utf8(const char * string)
{
    if(!string)
        return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
        if( (// ASCII
             // use bytes[0] <= 0x7F to allow ASCII control characters
                bytes[0] == 0x09 ||
                bytes[0] == 0x0A ||
                bytes[0] == 0x0D ||
                (0x20 <= bytes[0] && bytes[0] <= 0x7E)
            )
        ) {
            bytes += 1;
            continue;
        }

        if( (// non-overlong 2-byte
                (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF)
            )
        ) {
            bytes += 2;
            continue;
        }

        if( (// excluding overlongs
                bytes[0] == 0xE0 &&
                (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// straight 3-byte
                ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
                    bytes[0] == 0xEE ||
                    bytes[0] == 0xEF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// excluding surrogates
                bytes[0] == 0xED &&
                (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            )
        ) {
            bytes += 3;
            continue;
        }

        if( (// planes 1-3
                bytes[0] == 0xF0 &&
                (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// planes 4-15
                (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// plane 16
                bytes[0] == 0xF4 &&
                (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            )
        ) {
            bytes += 4;
            continue;
        }

        return 0;
    }

    return 1;
}

And this code for encoding to UTF8 if detection is not true :

     string text;
     if(!is_utf8(EscReason.c_str()))
     {
        int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), 0, 0);
        std::wstring utf16_str(size, '\0');

        MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), &utf16_str[0], size);
    
        int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), 0, 0, 0, 0);

        std::string utf8_str(utf8_size, '\0');
        WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), &utf8_str[0], utf8_size, 0, 0);

        text = utf8_str;
     }

Or code above is not done properly? Also i do it in the Windows 7. And how about Ubuntu? Does this variant work there?

4 Answers 4

20

Comparing whole byte values is not the correct way to detect UTF-8. You have to analyze the actual bit patterns of each byte. UTF-8 uses a very distinct bit pattern that no other encoding uses. Try something more like this instead:

bool is_utf8(const char * string)
{
    if (!string)
        return true;

    const unsigned char * bytes = (const unsigned char *)string;
    int num;

    while (*bytes != 0x00)
    {
        if ((*bytes & 0x80) == 0x00)
        {
            // U+0000 to U+007F 
            num = 1;
        }
        else if ((*bytes & 0xE0) == 0xC0)
        {
            // U+0080 to U+07FF 
            num = 2;
        }
        else if ((*bytes & 0xF0) == 0xE0)
        {
            // U+0800 to U+FFFF 
            num = 3;
        }
        else if ((*bytes & 0xF8) == 0xF0)
        {
            // U+10000 to U+10FFFF 
            num = 4;
        }
        else
            return false;

        bytes += 1;
        for (int i = 1; i < num; ++i)
        {
            if ((*bytes & 0xC0) != 0x80)
                return false;
            bytes += 1;
        }
    }

    return true;
}

Now, this does not take into account illegal UTF-8 sequences, such as overlong encodings, UTF-16 surrogates, and codepoints above U+10FFFF. If you want to make sure the UTF-8 is both valid and correct, you would need something more like this:

bool is_valid_utf8(const char * string)
{
    if (!string)
        return true;

    const unsigned char * bytes = (const unsigned char *)string;
    unsigned int cp;
    int num;

    while (*bytes != 0x00)
    {
        if ((*bytes & 0x80) == 0x00)
        {
            // U+0000 to U+007F 
            cp = (*bytes & 0x7F);
            num = 1;
        }
        else if ((*bytes & 0xE0) == 0xC0)
        {
            // U+0080 to U+07FF 
            cp = (*bytes & 0x1F);
            num = 2;
        }
        else if ((*bytes & 0xF0) == 0xE0)
        {
            // U+0800 to U+FFFF 
            cp = (*bytes & 0x0F);
            num = 3;
        }
        else if ((*bytes & 0xF8) == 0xF0)
        {
            // U+10000 to U+10FFFF 
            cp = (*bytes & 0x07);
            num = 4;
        }
        else
            return false;

        bytes += 1;
        for (int i = 1; i < num; ++i)
        {
            if ((*bytes & 0xC0) != 0x80)
                return false;
            cp = (cp << 6) | (*bytes & 0x3F);
            bytes += 1;
        }

        if ((cp > 0x10FFFF) ||
            ((cp >= 0xD800) && (cp <= 0xDFFF)) ||
            ((cp <= 0x007F) && (num != 1)) ||
            ((cp >= 0x0080) && (cp <= 0x07FF) && (num != 2)) ||
            ((cp >= 0x0800) && (cp <= 0xFFFF) && (num != 3)) ||
            ((cp >= 0x10000) && (cp <= 0x1FFFFF) && (num != 4)))
            return false;
    }

    return true;
}
Sign up to request clarification or add additional context in comments.

7 Comments

how (*bytes & 0xE0) == 0xC0 gives range from 0x80 to 0x7ff....???it should give range from 0xc0 to 0xdf
@ahmedallam no, what I wrote is correct. Look at the bit pattern table described on Wikipedia for UTF-8. Unicode codepoints U+0080 to U+07FF (not bytes 0xC0 to 0xDF) are encoded in 2 bytes using the bit pattern 110xxxxx 10xxxxxx. 0xE0 is bits 11100000 and 0xC0 is bits 11000000. So, if ((*bytes & 0xE0) == 0xC0) is checking if the high 3 bits of the 1st byte are 110 before (*bytes & 0x1F) grabs the low 5 bits. Then later, ((*bytes & 0xC0) != 0x80) checks if the high 2 bits of the 2nd byte are 10 before (*bytes & 0x3F) grabs the low 6 bits.
@ahmedallam seems you need to brush up on how bits, bit masks, and bitwise operators work.
@RemyLebeau Is this exception/thread safe ? (noob question)
@NorbertBoros as long as the string parameter is pointing at a valid C-style null-terminated string, and that memory is not modified or freed by another thread while the function is running, then yes, the function is safe. Otherwise, its behavior is undefined.
|
9

You probably don't understand UTF-8 and the alternatives. There are only 256 possible values for a byte. That's not a lot, given the number of characters. As a result, many byte sequences are both valid UTF-8 strings and valid strings in other encodings.

In fact, every ASCII string is intentionally a valid UTF-8 string with essentially the same meaning. Your code would return true for ìs_utf8("Hello").

Even many other non-UTF8, non-ASCII strings share a byte sequences with valid UTF-8 strings. And there's just no way to convert a non-UTF-8 string to UTF-8 without knowing exactly what kind of non-UTF-8 encoding it is. Even Latin-1 and Latin-2 are already quite different. CP_ACP is even worse than Latin-1, CP_ACP isn't even the same everywhere.

Your text must go into the database as UTF-8. Thus, if it isn't yet UTF-8, it must be converted, and you must know the exact source encoding. There is no magical escape.

On Linux, iconv is the usual method to convert between 2 encodings.

1 Comment

The question does not do the technical break down, but I think it is understandable, that a "UTF-8 stream" by it's grammar is a subclass of a "byte stream" and independent from the type of extension of the 7-Bit ASC II character set. Only the difference class is detectable.
1

Simple validation of null-terminated UTF-8 string (C++20)

#include <cassert>
#include <bit>

constexpr bool validate_utf8(const char* string) noexcept
{
    assert(string != nullptr);

    while (*string)
    {
        switch (std::countl_one(static_cast<unsigned char>(*string)))
        {
            [[unlikely]] case 4: ++string; if (std::countl_one(static_cast<unsigned char>(*string)) != 1) return false; [[fallthrough]];
            [[unlikely]] case 3: ++string; if (std::countl_one(static_cast<unsigned char>(*string)) != 1) return false; [[fallthrough]];
            [[unlikely]] case 2: ++string; if (std::countl_one(static_cast<unsigned char>(*string)) != 1) return false; [[fallthrough]];
              [[likely]] case 0: ++string; break;
            [[unlikely]] default: return false;
        }
    }

    return true;
}

Another function (C++14) that is 3 times faster than the previous code. At least on the GCC. But the code above is elegant.

#include <cassert>

constexpr bool validate_utf8(const char* string) noexcept
{
    assert(string != nullptr);

    while (*string)
    {
        if ((*string & 0b10000000) != 0)
        {
            if ((*string & 0b01000000) == 0) return false;
            if ((*string & 0b00100000) != 0)
            {
                if ((*string & 0b00010000) != 0)
                {
                    if ((*string & 0b00001000) != 0)
                        return false;

                    if ((*++string & 0b11000000) != 0b10000000)
                        return false;
                }

                if ((*++string & 0b11000000) != 0b10000000)
                    return false;
            }

            if ((*++string & 0b11000000) != 0b10000000)
                return false;
        }

        ++string;
    }

    return true;
}

1 Comment

std::countl_one only exists since C++20, this is the first I've heard of it.
0

This is not the kind of function you want to write yourself. I would suggest you look at using simdjson, which is what I am using for this purpose.

Don't be deterred by the fact the library is called simdjson, in other words, because it has the word "JSON" in it. It also contains a function for validating UTF-8 strings.

const char * some_string = "[ 1, 2, 3, 4] ";
size_t length = std::strlen(some_string);
bool is_ok = simdjson::validate_utf8(some_string, length);

You can find a useful blog post about this function here.

If you don't want to import the whole of simdjson into your project, my understanding it it is possible to use simdutf instead.

I'm not sure if simdjson depends on simdutf, or uses parts of it. It may do.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.