String casting and Unicode in golang

Question

String in Go is an immutable sequence of bytes (8-bit byte values) This is different than languages like Python, C#, Java or Swift where strings are Unicode.

I am playing around with following code:

s := "日本語"
b :=[]byte{0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 0xaa, 0x9e}
fmt.Println(string(b) == s) // true

for i, runeChar := range b {
    fmt.Printf("byte position %d: %#U\n", i, runeChar)
}

//byte position 0: U+00E6 'æ'
//byte position 1: U+0097
//byte position 2: U+00A5 '¥'
//byte position 3: U+00E6 'æ'
//byte position 4: U+009C
//byte position 5: U+00AC '¬'
//byte position 6: U+00E8 'è'
//byte position 7: U+00AA 'ª'
//byte position 8: U+009E

for i, runeChar := range string(b) {
    fmt.Printf("byte position %d: %#U\n", i, runeChar)
}

//byte position 0: U+65E5 '日'
//byte position 3: U+672C '本'
//byte position 6: U+8A9E '語'

Questions:

From where does Golang get Unicode for encoding byte array when custing to string? How does rune form? Does Golang compiler get Unicode from text file encoding during compilation?
What are advantages and disadvantages of implementing String like a byte array, instead of utf-16 chars array like in Java?

There is no "casting". Go assumes strings are a utf8 encoded series of bytes. — Mr_Pink
– Mr_Pink, Commented Jun 15, 2018 at 15:20
Question 1 is very cryptic could you clarify? Question 2: Java got it wrong. UTF-16 is a stupid encoding: It's wasteful for ASCII while still not providing big enough range for all codepoints. UTF-8 is the only sensible encoding. — Volker
– Volker, Commented Jun 15, 2018 at 18:09

peterSO · Accepted Answer · 2018-06-15 16:18:56Z

10

You are quoting from a weak, unreliable source: Go Essentials: Strings. Amongst other things, there is no mention of Unicode codepoints or UTF-8 encoding.

For example,

package main

import "fmt"

func main() {
    s := "日本語"
    fmt.Printf("Glyph:             %q\n", s)
    fmt.Printf("UTF-8:             [% x]\n", []byte(s))
    fmt.Printf("Unicode codepoint: %U\n", []rune(s))
}

Playground: https://play.golang.org/p/iaYd80Ocitg

Output:

Glyph:             "日本語"
UTF-8:             [e6 97 a5 e6 9c ac e8 aa 9e]
Unicode codepoint: [U+65E5 U+672C U+8A9E]

References:

The Go Blog: Strings, bytes, runes and characters in Go

The Go Programming Language Specification

Unicode FAQ: UTF-8, UTF-16, UTF-32 & BOM

The Unicode Consortium

edited Jun 15, 2018 at 16:18

answered Jun 15, 2018 at 15:40

peterSO

168k32 gold badges302 silver badges293 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

String casting and Unicode in golang

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related