Unicode not displaying correctly using JavaScript

Question

convert.onclick =
  function() {
    for (var i = 0; i < before.value.length; i++) {
      after.value += "'" + before.value.charAt(i) + "', ";
    }
  }

<textarea id="before" type="text" name="input" style="width:100%;">*𝟡(𝟘)-_=+𝕢ℚ𝕨𝕎𝕖𝔼𝕣ℝ𝕥𝕋𝕪𝕐𝕦𝕌</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

Here's a simple code, and when I operate it, I get the following results.

Problem

Some letters have been converted successfully. But most Unicode characters are not displayed normally. How do I fix this problem?

Are you sure all your unicode character have only length of 1? — connexo
– connexo, Commented Mar 5, 2020 at 17:50
I don't understand what you are trying to do. Converted to what? — evolutionxbox
– evolutionxbox, Commented Mar 6, 2020 at 10:26

Klaycon · Accepted Answer · 2020-03-05 18:03:15Z

What you're running into are called surrogate pairs. Some unicode characters are composed of two bytes instead of one, and if you separate them, they no longer display correctly.

If you can use ES6, iterating a string with the spread operator or for..of syntax actually takes surrogate pairs into account and will give you correct results easier. Other answers show how to do this.

If you can't use ES6, MDN has an example of how to handle these with charAt here. I'll use this code below.

function getWholeChar(str, i) {
  var code = str.charCodeAt(i);

  if (Number.isNaN(code)) return '';
  if (code < 0xD800 || code > 0xDFFF) return str.charAt(i);

  if (0xD800 <= code && code <= 0xDBFF) {
    if (str.length <= (i + 1)) throw 'High surrogate without following low surrogate';
    var next = str.charCodeAt(i + 1);
    if (0xDC00 > next || next > 0xDFFF) throw 'High surrogate without following low surrogate';
    return str.charAt(i) + str.charAt(i + 1);
  }
  
  if (i === 0) throw 'Low surrogate without preceding high surrogate';
  var prev = str.charCodeAt(i - 1);

  if (0xD800 > prev || prev > 0xDBFF) throw 'Low surrogate without preceding high surrogate';
  return false;
}

convert.onclick =
  function() {
    for (var i = 0, chr; i < before.value.length; i++) {
      if(!(chr = getWholeChar(before.value, i))) continue;
      after.value += "'" + chr + "', ";
    }
  }

<textarea id="before" type="text" name="input" style="width:100%;">*𝟡(𝟘)-_=+𝕢ℚ𝕨𝕎𝕖𝔼𝕣ℝ𝕥𝕋𝕪𝕐𝕦𝕌</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

ponury-kostek · Accepted Answer · 2020-03-05 17:57:36Z

1

You can use spread operator (...) to create array of unicode characters

convert.onclick = function () {
	after.value = [...before.value].map(s => `'${s}'`).join(",");
};

<textarea id="before" type="text" name="input" style="width:100%;">*𝟡(𝟘)-_=+𝕢ℚ𝕨𝕎𝕖𝔼𝕣ℝ𝕥𝕋𝕪𝕐𝕦𝕌</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

edited Mar 5, 2020 at 17:57

answered Mar 5, 2020 at 17:52

ponury-kostek

8,0835 gold badges27 silver badges33 bronze badges

Comments

connexo · Accepted Answer · 2020-03-05 18:08:38Z

0

That is because starting from a certain point in UTF-8, characters can have length > 1.

console.log("9".length);
console.log("𝟡".length);

console.log("𝟡".charAt(0));
console.log(String.fromCodePoint("𝟡".codePointAt(0)));

To fix it, instead of charAt use codePoint and codePointAt:

convert.onclick =
  function() {
    for (const char of before.value) {
      after.value += `'${String.fromCodePoint(char.codePointAt(0))}'`;
    }
  }

<textarea id="before" type="text" name="input" style="width:100%;">*𝟡(𝟘)-_=+𝕢ℚ𝕨𝕎𝕖𝔼𝕣ℝ𝕥𝕋𝕪𝕐𝕦𝕌</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

You can also do an index-based traversal, but that requires to increase the index varaible inside the loop, depending on the currently traversed character's length:

convert.onclick =
  function() {
    for (let i = 0; i < before.value.length; ) {
      after.value += `'${String.fromCodePoint(before.value.codePointAt(i))}'`;
      i+= String.fromCodePoint(before.value.codePointAt(i)).length;
    }
  }

<textarea id="before" type="text" name="input" style="width:100%;">*𝟡(𝟘)-_=+𝕢ℚ𝕨𝕎𝕖𝔼𝕣ℝ𝕥𝕋𝕪𝕐𝕦𝕌</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

edited Mar 5, 2020 at 18:08

answered Mar 5, 2020 at 17:51

connexo

57.2k15 gold badges112 silver badges149 bronze badges

4 Comments

ponury-kostek Over a year ago

Why to use String.fromCodePoint(char.codePointAt(0)) if you already have single unicode char?

connexo Over a year ago

@Ponury Just for demonstration purposes, to make OP aware of these methods.

Klaycon Over a year ago

using for..of to iterate a string also makes the use of codePointAt redundant as it also results in surrogate pairs being accounted for

connexo Over a year ago

I do know that, but doing an index-based loop here would make things a little tricky since you would have to increase the index variable inside the loop, depending on the characters length property.

Collectives™ on Stack Overflow

Unicode not displaying correctly using JavaScript

3 Answers 3

Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related