0

I faced a weird problem today after trying to decode a utf8 formatted string. It's being fetched through stream as an array of strings but formatted in utf8 somehow (I'm using fast-csv). However as you can see in the console if I log it directly it shows the correct version but when it's inside an object literal it's back to utf8 encoded version.

  var stream = fs
    .createReadStream(__dirname + '/my.csv')
    .pipe(csv({ ignoreEmpty: true }))
    .on('data', data => {
        console.log(data[0])
        // prints [email protected]
        console.log({ firstName: data[0] })
        // prints { firstName: '\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n\u0000@\u0000r\u0000o\u0000g\u0000e\u0000r\u0000s\u0000.\u0000c\u0000o\u0000m\u0000' }
    })

Any solution or explanations are appreciated.

Edit: even after decoding using utf8.js and then pass it in the object literal, I still encounter the same problem.

3
  • 1
    Given the NUL characters interleaved in the output, it seems like the input might be UTF-16 being read as UTF-8. Ask the author of the CSV file which encoding they chose (or ask for an xlsx instead-they are much more self-describing). Commented Jul 3, 2018 at 17:16
  • @TomBlodget oh my god, I can't believe I didn't check that already. I though it was utf8 all this time, since it was exported from google contacts. You can answer this question and I'll mark it as accepted. Thank you! Commented Jul 6, 2018 at 1:37
  • @TomBlodget btw what do you mean by NUL characters interleaved and how can I self study about these stuff? Did you see an special character that you figured it out? Commented Jul 6, 2018 at 1:39

1 Answer 1

2

JavaScript uses UTF-16 for Strings. It also has a numeric escape notation for a UTF-16 code unit. So, when you see this output in your debugger

\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n

It is saying that the String's code units are \u0000 f \u0000 a etc. The \uHHHH escape means the UTF-16 code unit HHHH in hexadecimal. \u0000 is the single (unpaired) UTF-16 code unit need for the U+0000 (NUL) Unicode codepoint. So, something is being interpreted as NUL f NUL a, etc.

UTF-8 code units are 8 bits each. NUL in UTF-8 is 0x00. f is 0x66.

UTF-16 code units are 16 bits each. NULL is 0x0000. f is 0x0066. When 16-bit values are stored as bytes, endianness applies. In little endian, 0x0066 is written as 0x66 0x00. In big endian, 0x00 0x66.

So, if bytes of UTF-16 code units (such as the ones in the example data) are interpreted as UTF-8 (or perhaps other encodings), f can be read as NUL f or f NUL.

The fundamental rule of character encodings is to read with the same encoding that text was written with. No doing so can lead to data loss and corruption that can go on undetected. Not knowing what the encoding is to begin with is data loss itself and a failed communication.

You can learn more about Unicode at Unicode.org. You can learn more about languages and technologies that use it from their respective specifications—they are all very upfront and clear about it. JavaScript, Java, C#, VBA/VB4/VB5/VB6, VB.NET, F#, HTML, XML, T-SQL,…. (Okay, VB4 documentation might not be quite as clear but the point is that this is very common and not new [VBPJ Sept. 1996], though we all are still struggling to assimilate it.)

Sign up to request clarification or add additional context in comments.

1 Comment

I exported the csv file via google contacts and had no idea it's UTF-16. After reading your answer I opened it in VSC instead of Vim and saw it's actually UTF-16, changed it to utf-8 and everything started working again

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.