5

How to detect string encoding in Node JS and convert the string into a valid unicode string.

For example, how do I detect a CP437 encoded string and convert it into a valid unicode string.

Input: ¨Quin ha enga¤ado

Output: ¿Quién ha engañado

I wish to dynamically detect the encoding type and convert the string into a valid unicode string. Thanks in advance.

1 Answer 1

5

There's no such thing as a CP437-encoded String in [Node]JS. Strings are always Unicode (well, UTF-16 code units).

What you have in ¨Quin ha enga¤ado is String that has been decoded from bytes using the wrong encoding at some point in the past (aka mojibake). You need to find where that String came from, and change the encoding that was used to convert it from bytes.

It is sometimes possible to rescue a badly-decoded string by encoding back to a Buffer using the same encoding as was wrongly used to decode it, and then decoding it again with the right encoding this time. But this only works when all the bytes used happen to have mappings in the wrongly-used code page, and there is no further damage to the string.

It looks like you have a string that has been decoded using ISO-8859-1, so in principle you could encode it as ISO-8859-1 (eg new Buffer(s, 'binary')) and then decode the buffer as cp437 (unfortunately this encoding is not available in Node so you need a third-party module such as iconv-lite).

However, your string has suffered further damage in that the é has completely disappeared. That could be because the misdecoded character for that byte is an invisible control character that StackOverflow doesn't allow to be posted, or it could be because the that control character has been lost somewhere up the chain. If so, you cannot recover the original string at all.

I wish to dynamically detect the encoding type

There is no general way to automatically detect the encoding of a buffer, only vague heuristics (see the chardet module for an implementation of this). This is doubly difficult when you have mojibake, because you have to guess both the real encoding, and the wrongly-applied encoding.

You can burn a lot of time trying to detect common patterns but ultimately you will never have a reliable solution. After all, ¨Quin ha enga¤ado is a perfectly valid sequence of characters already, how would your code know that wasn't what was meant?

Much better to fix the bug further up, where the bad decode actually happened.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your suggestions. This information is actually crawled from the web, and there's no control over the source info since its all from open websites.
When you're scraping, you want to determine/guess the encoding of the page at the point you download it. There are some examples in [this question(stackoverflow.com/questions/12326688/node-js-scrape-encoding) if you are using request.
If you know that language of the document. You can run an encoding conversion of a list of encodings to the same list (A->A, A->B, A->C etc.. ) And than check that the resulting text does not have any other charecters than the characters allowed in the document language.
Do you know any way to detect if Buffer instance is encoded in CP437? I can use iconv to decode it but first I need to detect if it's CP437 or not. I've checked two 3rd parties and one detect that the file is ASCII and another that it's UTF16 and it was fuzzy because it detect like 5 different encodings and none was CP437. In PHP there are functions to detect the encoding.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.