4

I want to convert characters to numerical codes, so I tried string.byte("å"). However, it seems that the return value of string.byte() is 195 for these kind of characters;

any way to get a numerical code of non-ascii characters like:?

à,á,â,ã,ä,å

I'm using pure lua;

4
  • 7
    Its UTF-8 code is 195,165 (two bytes), it can be obtained by print(string.byte("å",1,-1)) Commented Jun 12, 2014 at 17:43
  • possible duplicate of What is Unicode, UTF-8, UTF-16? Commented Jun 12, 2014 at 17:49
  • 1
    A Lua string is a counted sequence of bytes. What you put in those bytes, in this case, is between you and your code editor. Commented Jun 12, 2014 at 22:14
  • Your question is a bit unclear. You have used saved your script with a UTF-8 encoding. @YuHao shows how the retrieve the variable number of bytes for each character in a string. But, do you actually want the codepoints for the characters? For å in Unicode, it would be 229. Commented Jun 13, 2014 at 1:04

2 Answers 2

5

Lua thinks a string is a sequence of bytes, but a Unicode character may contain multiple bytes.

Assuming the string is has valid UTF-8 encoding, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" to match a single UTF-8 byte sequence. (In Lua 5.1, use "[%z\1-\127\194-\244][\128-\191]*"), and then get its numerical codes:

local str = "à,á,â,ã,ä,å"

for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do
    print(c:byte(1, -1))
end

Output:

195 160
44
195 161
44
195 162
44
195 163
44
195 164
44
195 165

Note that 44 is the encoding for the comma.

Sign up to request clarification or add additional context in comments.

3 Comments

Lua defines a string as a counted sequence of bytes. The spec defines it so. If you want something else, you took the wrong data-type.
@Deduplicator You mean: because native Lua string doesn't support Unicode (yet), then don't try to solve Unicode problems with Lua? Why not if the solution is so simple?
That's not what I meant. I just said that the Lua documentation cannot be mistaken in what a string is, because it is the authority responsible for the definition (at least with regards to Lua). The fact that a byte-string is not restricted to valid UTF-8 (any UTF-8 string is a valid Lua string though), nor is just an interface to unicode codepoints or graphemes or grapheme-clusters does not change anything. Just change the first sentence, and it's ok. BTW: Changing the Lua string to be restricted to Unicode and enforcing Unicode semantics would make it useless in many contexts.
0

It's like string.byte (), but for unicode:

function utf8Byte(char)
    local b1, b2 = char:byte(1, 2)
    local b3, b4 = char:byte(3, 4)
    if b1 < 20 then
        return nil
    elseif b1 < 128 then
--      b1 is less than 128, it's a single-byte character
        return b1
    elseif b1 < 194 then
        return nil
    elseif b1 < 224 then
        return (b1 - 192) * 64 + (b2 - 128)
    elseif b1 < 240 then
        return (b1 - 192) * 64 + (b2 - 128)
    elseif b1 < 245 then
        return (b1 - 240) * 262144 + (b2 - 128) * 4096 + (b3 - 128) * 64 + (b4 - 128)
    else
        return nil
    end
end

Example:

local unicodeChars = {"A", "~", "¡", "ÿ", "Ā", "Ȁ", "Ф", "ૐ", "⼈", "ff", "𐌸"}
for _, uChar in ipairs (unicodeChars) do
    local index = utf8Byte (uChar)
    print (index, uChar)
end

Result:

65  A
126 ~
161 ¡
255 ÿ
256 Ā
512 Ȁ
1060    Ф
2091    ૐ
2236    ⼈
3052    ff
66360   𐌸

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.