JS File upload: Detect Encoding

Question

So, I'm trying to write a CSV-file importer using AngularJS on the frontend side and NodeJS for the backend. My problem is, that I'm not sure about the encoding of the incoming CSV files. Is there a way to automatically detect it?

I first tried to use FileReader.readAsDataURL() and do the detection in Node. But the file contents will be Base64-encoded, so I cannot do that (When I decode the file, I already need to know the encoding). If I do a FileReader.readAsText(), I also need to know the encoding beforehand. I also cannot do it BEFORE initializing the FileReader, because the actual file object doesn't seem include the files contents.

My current code:

generateFile = function(file){
    reader = new FileReader();
    reader.onload = function (evt) {
        if (checkSize(file.size) && isTypeValid(file.type)) {
            scope.$apply(function () {
                scope.file = evt.target.result;
                file.encoding = Encoding.detect(scope.file);
                if (angular.isString(scope.fileName)) {
                    return scope.fileName = name;
                }
            });
            if (form) {
                form.$setDirty();
            }
            scope.fileArray.push({
                name: file.name,
                type: file.type,
                size: file.size,
                date: file.lastModified,
                encoding: file.encoding,
                file: scope.file
            });
            --scope.pending;
            if (scope.pending === 0){
                scope.$emit('file-dropzone-drop-event', scope.fileArray);
                scope.fileArray = [];
            }
        }
    };
    let fileExtExpression = /\.csv+$/i;
    if(fileExtExpression.test(file.name)){
        reader.readAsText(file);
    }
    else{
        reader.readAsDataURL(file);
    }
    ++scope.pending;
}

Is this just impossible to do or what am I doing wrong? I even tried to solve this using FileReader.readAsArrayBuffer() and extract the file header from there, but this was way too complex for me and/or didn't seem to work.

Sadly file encoding are not easily detectable, so where is your Encoding.detect coming from? Also as far as I know most text editors just "probe" the file for some typical encoded characters and then guess the encoding from that and read it again with that encoding. — xander
– xander, Commented Feb 20, 2018 at 12:32
The encoding.detect is from this external library. Unfortunately it doesn't work since I cannot put the file contents into it before I used FileReader.readAsXYZ()... — DCH
– DCH, Commented Feb 20, 2018 at 14:02
base64 is a byte encoding, not a character encoding. It turns an array of bytes into a string. So when you decode it, you get back an array of bytes; you don't need to know any character encoding for that just yet. Now these bytes can represent a string, in which case they need further decoding, and for this step you will need to know the character encoding. Given a byte array you can make a few educated guesses, this works reasonably well with UTF encodings. But the problem is single-byte encodings, which are impossible to distinguish with certainty. — Tomalak
– Tomalak, Commented Feb 20, 2018 at 14:05
Now the question is, why is the contents of the files base64-encoded? That doesn't make a lot of sense. Files are meant to be byte storage. You can write bytes to the files verbatim. Encoding the byte stream as base64 does nothing in this case, except making the file larger and slowing down both writing and reading of the file. — Tomalak
– Tomalak, Commented Feb 20, 2018 at 14:09
Yes, I know that. So base64 is now only used to upload images. So, for my CSV files I would like to just read it as text. But if I don't know how to dynamically set the encoding-parameter for FileReader.readAsText(). — DCH
– DCH, Commented Feb 20, 2018 at 14:14

guillim · Accepted Answer · 2020-09-02 08:09:18Z

5

I suggest you open your CSV using readAsBinaryString() from FileReader. This is the trick. Then you can detect the encoding using the library jschardet

More info here: CSV encoding detection in javascript

answered Sep 2, 2020 at 8:09

guillim

1,5771 gold badge13 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

mikey Over a year ago

In node: const jschardet = require('jschardet'); jschardet.detect(await require('fs/promises').readFile(fileName)), e.g., jschardet.detect(await require('fs/promises').readFile('test.txt')), output: { encoding: 'UTF-8', confidence: 0.99 }

Falaen · Accepted Answer · 2021-04-21 11:12:38Z

3

You could try this:

$ npm install detect-file-encoding-and-language

And then detect the encoding like so:

// index.js

const languageEncoding = require("detect-file-encoding-and-language");

const pathToFile = "/home/username/documents/my-text-file.txt"

languageEncoding(pathToFile).then(fileInfo => console.log(fileInfo));
// Possible result: { language: japanese, encoding: Shift-JIS, confidence: { language: 0.97, encoding: 1 } }

edited Apr 21, 2021 at 11:12

answered Mar 24, 2021 at 14:09

Falaen

3834 silver badges13 bronze badges

Collectives™ on Stack Overflow

JS File upload: Detect Encoding

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related