16

A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.

How can I check if a string is a valid UTF-8?

2

3 Answers 3

7

Exposition

I think you misunderstand what "UTF-8 characters" means; UTF-8 is an encoding of Unicode which can represent any character, glyph, and grapheme that is defined in the (ever growing) Unicode standard. There are fewer Unicode code points than there are possible UTF8 byte values, so the only "invalid UTF8 characters" are UTF8 byte sequences that don't map to any Unicode code point, but I assume this is not what you're referring to.

for example, a copy and paste from a rtf file that contains tabs.

RTF is a formatting system which works independently of the underlying encoding scheme - you can use RTF with ASCII, UTF-8, UTF-16 and other encodings. With respect to the HTML textboxes in your post, both the <input type="text"> and <textarea> elements in HTML only respect plaintext, so any RTF formatting will be automatically stripped when pasted by a user, hence why JS-heavy "rich-edit" and contenteditable components are notuncommon in web-applications, though in this answer I assume you're not using a rich-edit component in a web-page).

Tabs in RTF files are not an RTF feature: they're just normal ASCII-style tab characters, i.e. \t or 0x09, which also appear in Unicode, and thus, can also appear in UTF-8 encoded text; furthermore, it's perfectly valid for web-browsers to allow users to paste those into <input> and <textarea>.


Javascript (ECMAScript) itself is Unicode-native; that is, the ECMAScript specification does require JS engines to use UTF-16 representations in some places, such as in the abstract operation IsStringWellFormedUnicode:

7.2.9 Static Semantics: IsStringWellFormedUnicode

The abstract operation IsStringWellFormedUnicode takes argument string (a String) and returns a Boolean. It interprets string as a sequence of UTF-16 encoded code points, as described in 6.1.4, and determines whether it is a well formed UTF-16 sequence.

...but that part of the specification is intended for JS engine programmers, and not people who write JS for use in browsers - in fact, I'd say it's safe to asume that within a web-browser, any-and-all JS string values will always be valid strings that can always be serialized out to UTF-8 and UTF-16, and also that JS scripts should not be concerned with the actual in-memory encoding of the string's content.

Your question

So given that your question is written as this:

A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.

How can I check if a string is a valid UTF-8?

I'm going to interpret it as this:

A user can copy RTF text from a program like WordPad and paste it into a HTML <textarea> or <input type="text"> in a web-browser, and when it's pasted the plaintext representation of the RTF still contains certain characters that my application should not accept such as whitespace like tabs.

How can I detect these unwanted characters and inform the user - or remove those unwanted characters?

...to which my answer is:

I suggest just stripping-out unwanted characters using a regular-expression that matches non-visible characters (from here: Match non printable/non ascii characters and remove from text )

let textBoxContent = document.getElementById( 'myTextarea' ).value;
textBoxContent = textBoxContent.replace( /[^\x20-\x7E]+/g, '' );
  • The expression [^\x20-\x7E] matches any character NOT in the codepoint range 0x20 (32, a normal space character ' ') to 0x7E (127, the tidle '~' character), all other characters will be removed, including non-Latin text.

  • The g switch at the end makes it a global find-and-replace operation; without the g then only the first unwanted character would be removed.

  • The range 0x20-0x7E works because Unicode's first 127 codepoints are identical to ASCII and can be seen here: http://www.asciitable.com/

Sign up to request clarification or add additional context in comments.

4 Comments

To correct some misconceptions in this answer, too: there is no such thing as UTF8 "characters"; as an encoding scheme there are "UTF8 byte sequences", encoding Unicode code points, and these byte sequences can absolutely suffer from illegal values in the byte sequence. Similarly, Unicode as the formal mapping of "orthographic constructs" to numerical codes also has certain numbers that may not be used. Encountering a UTF8 byte stream with an illegal byte sequence, or a decoded Unicode sequence containing illegal numbers, is entirely possible, so: yes, there are "invalid UTF-8 characters".
@Mike'Pomax'Kamermans I've rewritten my answer to implement your feedback; thank you for the input.
I've further edited your text because that's not a technical detail if the whole point of the paragraph is to explain that the answer is "yes", but that the question the answer is "yes" to isn't what they wanted to know.
@Mike'Pomax'Kamermans 👍
2

Just an idea:

function checkUTF8(text) {
    var utf8Text = text;
    try {
        // Try to convert to utf-8
        utf8Text = decodeURIComponent(escape(text));
        // If the conversion succeeds, text is not utf-8
    }catch(e) {
        // console.log(e.message); // URI malformed
        // This exception means text is utf-8
    }   
    return utf8Text; // returned text is always utf-8
}

2 Comments

escape is deprecated and should not be used (because it can't handle Unicode properly)
What does "text is not utf-8" mean? It seems this means text is ASCII? and in the catch it is unicode?
2

Indeed, it is possible to create invalid UTF-8 strings in JavaScript if you are e.g. parsing from raw buffers:

// in Node.js
Buffer.from([240, 160, 174, 183]).toString("utf8") // '𠮷'
Buffer.from([240, 160, 174]).toString("utf8") // '�'

// in browser or Node.js
new TextDecoder().decode(new Uint8Array([240, 160, 174, 183])) // '𠮷'
new TextDecoder().decode(new Uint8Array([240, 160, 174])) // '�'

To be clear, the string isn't really invalid anymore -- JavaScript has converted it to a valid string by replacing the invalid bytes with the replacement character, �.

So one option is to check for '�' in your string.

But if you want to check if your original sequence of bytes is valid UTF-8 without this automatic conversion, you can decode it with a TextDecoder with the fatal option set to true:

function isValidUtf8(bytes) {
  let decoder = new TextDecoder("utf8", { fatal: true });
  try {
    decoder.decode(bytes);
  } catch {
    return false;
  }
  return true;
}

isValidUtf8(new Uint8Array([240, 160, 174, 183])); // true
isValidUtf8(new Uint8Array([240, 160, 174])); // false

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.