lua - string.byte for non ascii characters

Question

I want to convert characters to numerical codes, so I tried string.byte("å"). However, it seems that the return value of string.byte() is 195 for these kind of characters;

any way to get a numerical code of non-ascii characters like:?

à,á,â,ã,ä,å

I'm using pure lua;

Its UTF-8 code is 195,165 (two bytes), it can be obtained by print(string.byte("å",1,-1)) — Egor Skriptunoff
– Egor Skriptunoff, Commented Jun 12, 2014 at 17:43
A Lua string is a counted sequence of bytes. What you put in those bytes, in this case, is between you and your code editor. — Tom Blodget
– Tom Blodget, Commented Jun 12, 2014 at 22:14
Your question is a bit unclear. You have used saved your script with a UTF-8 encoding. @YuHao shows how the retrieve the variable number of bytes for each character in a string. But, do you actually want the codepoints for the characters? For å in Unicode, it would be 229. — Tom Blodget
– Tom Blodget, Commented Jun 13, 2014 at 1:04

Yu Hao · Accepted Answer · 2014-06-13 00:48:31Z

5

Lua thinks a string is a sequence of bytes, but a Unicode character may contain multiple bytes.

Assuming the string is has valid UTF-8 encoding, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" to match a single UTF-8 byte sequence. (In Lua 5.1, use "[%z\1-\127\194-\244][\128-\191]*"), and then get its numerical codes:

local str = "à,á,â,ã,ä,å"

for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do
    print(c:byte(1, -1))
end

Output:

Note that 44 is the encoding for the comma.

answered Jun 13, 2014 at 0:48

Yu Hao

123k50 gold badges252 silver badges305 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Deduplicator Over a year ago

Lua defines a string as a counted sequence of bytes. The spec defines it so. If you want something else, you took the wrong data-type.

Yu Hao Over a year ago

@Deduplicator You mean: because native Lua string doesn't support Unicode (yet), then don't try to solve Unicode problems with Lua? Why not if the solution is so simple?

Deduplicator Over a year ago

That's not what I meant. I just said that the Lua documentation cannot be mistaken in what a string is, because it is the authority responsible for the definition (at least with regards to Lua). The fact that a byte-string is not restricted to valid UTF-8 (any UTF-8 string is a valid Lua string though), nor is just an interface to unicode codepoints or graphemes or grapheme-clusters does not change anything. Just change the first sentence, and it's ok. BTW: Changing the Lua string to be restricted to Unicode and enforcing Unicode semantics would make it useless in many contexts.

darkfrei · Accepted Answer · 2024-04-25 16:44:40Z

It's like string.byte (), but for unicode:

function utf8Byte(char)
    local b1, b2 = char:byte(1, 2)
    local b3, b4 = char:byte(3, 4)
    if b1 < 20 then
        return nil
    elseif b1 < 128 then
--      b1 is less than 128, it's a single-byte character
        return b1
    elseif b1 < 194 then
        return nil
    elseif b1 < 224 then
        return (b1 - 192) * 64 + (b2 - 128)
    elseif b1 < 240 then
        return (b1 - 192) * 64 + (b2 - 128)
    elseif b1 < 245 then
        return (b1 - 240) * 262144 + (b2 - 128) * 4096 + (b3 - 128) * 64 + (b4 - 128)
    else
        return nil
    end
end

Example:

local unicodeChars = {"A", "~", "¡", "ÿ", "Ā", "Ȁ", "Ф", "ૐ", "⼈", "ﬀ", "𐌸"}
for _, uChar in ipairs (unicodeChars) do
    local index = utf8Byte (uChar)
    print (index, uChar)
end

Result:

Collectives™ on Stack Overflow

lua - string.byte for non ascii characters

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related