decoding array of utf8 strings inside a stream

Question

I faced a weird problem today after trying to decode a utf8 formatted string. It's being fetched through stream as an array of strings but formatted in utf8 somehow (I'm using fast-csv). However as you can see in the console if I log it directly it shows the correct version but when it's inside an object literal it's back to utf8 encoded version.

  var stream = fs
    .createReadStream(__dirname + '/my.csv')
    .pipe(csv({ ignoreEmpty: true }))
    .on('data', data => {
        console.log(data[0])
        // prints [email protected]
        console.log({ firstName: data[0] })
        // prints { firstName: '\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n\u0000@\u0000r\u0000o\u0000g\u0000e\u0000r\u0000s\u0000.\u0000c\u0000o\u0000m\u0000' }
    })

Any solution or explanations are appreciated.

Edit: even after decoding using utf8.js and then pass it in the object literal, I still encounter the same problem.

Given the NUL characters interleaved in the output, it seems like the input might be UTF-16 being read as UTF-8. Ask the author of the CSV file which encoding they chose (or ask for an xlsx instead-they are much more self-describing). — Tom Blodget
– Tom Blodget, Commented Jul 3, 2018 at 17:16
@TomBlodget oh my god, I can't believe I didn't check that already. I though it was utf8 all this time, since it was exported from google contacts. You can answer this question and I'll mark it as accepted. Thank you! — Pouya Sanooei
– Pouya Sanooei, Commented Jul 6, 2018 at 1:37
@TomBlodget btw what do you mean by NUL characters interleaved and how can I self study about these stuff? Did you see an special character that you figured it out? — Pouya Sanooei
– Pouya Sanooei, Commented Jul 6, 2018 at 1:39

Tom Blodget · Accepted Answer · 2018-07-07 17:15:32Z

2

JavaScript uses UTF-16 for Strings. It also has a numeric escape notation for a UTF-16 code unit. So, when you see this output in your debugger

\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n

It is saying that the String's code units are \u0000 f \u0000 a etc. The \uHHHH escape means the UTF-16 code unit HHHH in hexadecimal. \u0000 is the single (unpaired) UTF-16 code unit need for the U+0000 (NUL) Unicode codepoint. So, something is being interpreted as NUL f NUL a, etc.

UTF-8 code units are 8 bits each. NUL in UTF-8 is 0x00. f is 0x66.

UTF-16 code units are 16 bits each. NULL is 0x0000. f is 0x0066. When 16-bit values are stored as bytes, endianness applies. In little endian, 0x0066 is written as 0x66 0x00. In big endian, 0x00 0x66.

So, if bytes of UTF-16 code units (such as the ones in the example data) are interpreted as UTF-8 (or perhaps other encodings), f can be read as NUL f or f NUL.

The fundamental rule of character encodings is to read with the same encoding that text was written with. No doing so can lead to data loss and corruption that can go on undetected. Not knowing what the encoding is to begin with is data loss itself and a failed communication.

You can learn more about Unicode at Unicode.org. You can learn more about languages and technologies that use it from their respective specifications—they are all very upfront and clear about it. JavaScript, Java, C#, VBA/VB4/VB5/VB6, VB.NET, F#, HTML, XML, T-SQL,…. (Okay, VB4 documentation might not be quite as clear but the point is that this is very common and not new [VBPJ Sept. 1996], though we all are still struggling to assimilate it.)

edited Jul 7, 2018 at 17:15

answered Jul 6, 2018 at 21:58

Tom Blodget

20.9k3 gold badges46 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pouya Sanooei Over a year ago

I exported the csv file via google contacts and had no idea it's UTF-16. After reading your answer I opened it in VSC instead of Vim and saw it's actually UTF-16, changed it to utf-8 and everything started working again

Collectives™ on Stack Overflow

decoding array of utf8 strings inside a stream

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related