0

convert.onclick =
  function() {
    for (var i = 0; i < before.value.length; i++) {
      after.value += "'" + before.value.charAt(i) + "', ";
    }
  }
<textarea id="before" type="text" name="input" style="width:100%;">*𝟡(𝟘)-_=+𝕢ℚ𝕨𝕎𝕖𝔼𝕣ℝ𝕥𝕋𝕪𝕐𝕦𝕌</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

Here's a simple code, and when I operate it, I get the following results.

Problem

Some letters have been converted successfully. But most Unicode characters are not displayed normally. How do I fix this problem?

2
  • Are you sure all your unicode character have only length of 1? Commented Mar 5, 2020 at 17:50
  • I don't understand what you are trying to do. Converted to what? Commented Mar 6, 2020 at 10:26

3 Answers 3

2

What you're running into are called surrogate pairs. Some unicode characters are composed of two bytes instead of one, and if you separate them, they no longer display correctly.

If you can use ES6, iterating a string with the spread operator or for..of syntax actually takes surrogate pairs into account and will give you correct results easier. Other answers show how to do this.

If you can't use ES6, MDN has an example of how to handle these with charAt here. I'll use this code below.

function getWholeChar(str, i) {
  var code = str.charCodeAt(i);

  if (Number.isNaN(code)) return '';
  if (code < 0xD800 || code > 0xDFFF) return str.charAt(i);

  if (0xD800 <= code && code <= 0xDBFF) {
    if (str.length <= (i + 1)) throw 'High surrogate without following low surrogate';
    var next = str.charCodeAt(i + 1);
    if (0xDC00 > next || next > 0xDFFF) throw 'High surrogate without following low surrogate';
    return str.charAt(i) + str.charAt(i + 1);
  }
  
  if (i === 0) throw 'Low surrogate without preceding high surrogate';
  var prev = str.charCodeAt(i - 1);

  if (0xD800 > prev || prev > 0xDBFF) throw 'Low surrogate without preceding high surrogate';
  return false;
}

convert.onclick =
  function() {
    for (var i = 0, chr; i < before.value.length; i++) {
      if(!(chr = getWholeChar(before.value, i))) continue;
      after.value += "'" + chr + "', ";
    }
  }
<textarea id="before" type="text" name="input" style="width:100%;">*𝟡(𝟘)-_=+𝕢ℚ𝕨𝕎𝕖𝔼𝕣ℝ𝕥𝕋𝕪𝕐𝕦𝕌</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

Sign up to request clarification or add additional context in comments.

Comments

1

You can use spread operator (...) to create array of unicode characters

convert.onclick = function () {
	after.value = [...before.value].map(s => `'${s}'`).join(",");
};
<textarea id="before" type="text" name="input" style="width:100%;">*𝟡(𝟘)-_=+𝕢ℚ𝕨𝕎𝕖𝔼𝕣ℝ𝕥𝕋𝕪𝕐𝕦𝕌</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

Comments

0

That is because starting from a certain point in UTF-8, characters can have length > 1.

console.log("9".length);
console.log("𝟡".length);

console.log("𝟡".charAt(0));
console.log(String.fromCodePoint("𝟡".codePointAt(0)));

To fix it, instead of charAt use codePoint and codePointAt:

convert.onclick =
  function() {
    for (const char of before.value) {
      after.value += `'${String.fromCodePoint(char.codePointAt(0))}'`;
    }
  }
<textarea id="before" type="text" name="input" style="width:100%;">*𝟡(𝟘)-_=+𝕢ℚ𝕨𝕎𝕖𝔼𝕣ℝ𝕥𝕋𝕪𝕐𝕦𝕌</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

You can also do an index-based traversal, but that requires to increase the index varaible inside the loop, depending on the currently traversed character's length:

convert.onclick =
  function() {
    for (let i = 0; i < before.value.length; ) {
      after.value += `'${String.fromCodePoint(before.value.codePointAt(i))}'`;
      i+= String.fromCodePoint(before.value.codePointAt(i)).length;
    }
  }
<textarea id="before" type="text" name="input" style="width:100%;">*𝟡(𝟘)-_=+𝕢ℚ𝕨𝕎𝕖𝔼𝕣ℝ𝕥𝕋𝕪𝕐𝕦𝕌</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

4 Comments

Why to use String.fromCodePoint(char.codePointAt(0)) if you already have single unicode char?
@Ponury Just for demonstration purposes, to make OP aware of these methods.
using for..of to iterate a string also makes the use of codePointAt redundant as it also results in surrogate pairs being accounted for
I do know that, but doing an index-based loop here would make things a little tricky since you would have to increase the index variable inside the loop, depending on the characters length property.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.