5

Is there a simple way to check if string is valid UTF-8 sequence in JavaScript?

I really do not want to end with a regular expression like this:

Regex to detect invalid UTF-8 string

P.S.: I am receiving data from external API and sometimes (very rarely but it happens) it returns data with invalid UTF-8 sequences. Trying to put them into PostgreSQL results in an appropriate error.

6
  • 1
    I don't think that really makes any sense. A string is a list of characters. UTF-8 is a way of representing characters in a binary format. A string in itself does not have an encoding. Commented Dec 17, 2013 at 16:12
  • unless you are trying to determine if a string can be represented completely using utf-8 encoding ? Commented Dec 17, 2013 at 16:12
  • the only way to check for a valid UTF8 is to check whether or not it contains invalid utf8 chars. The regex you linked is an effective, concise and efficient way to perform the check. You can, of course, check against your own dictionary, in a custom tuned way. Commented Dec 17, 2013 at 16:13
  • 1
    I don't know of any built-in method so last time I needed this, I used text.match(/[\x80-\xFF]+/) to gather potential problems, and checked each match against the UTF-8 specification -- 52 lines of code. Using that regexp is actually a pretty neat, fast, and simple way. Commented Dec 17, 2013 at 16:14
  • 2
    or you are trying to figure out if a sequence of bytes can be interpreted as an utf-8 encoded string? Commented Dec 17, 2013 at 16:15

1 Answer 1

5

UTF-8 is in fact a simple encoding, but still what you are asking can't be done with a one-liner. You have to:

  1. Override the Content-Type of the response to have a byte array in your script and prevent the browser/library to interpret the response itself
  2. Looping over the bytes to make characters. Note that UTF-8 is a variable-length encoding, and that's why some sequences are invalid.
  3. If an invalid octet is found, skip it
  4. If needed, deserialize the JSON/XML/whatever string to a JavaScript object, possibly by handing failures

Deciding if a certain array is a valid UTF-8 sequence is quite a straightforward task (just a bunch of if statements and bit shiftings), but again it's not a one line thing.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.