So, I'm trying to write a CSV-file importer using AngularJS on the frontend side and NodeJS for the backend. My problem is, that I'm not sure about the encoding of the incoming CSV files. Is there a way to automatically detect it?
I first tried to use FileReader.readAsDataURL() and do the detection in Node. But the file contents will be Base64-encoded, so I cannot do that (When I decode the file, I already need to know the encoding). If I do a FileReader.readAsText(), I also need to know the encoding beforehand. I also cannot do it BEFORE initializing the FileReader, because the actual file object doesn't seem include the files contents.
My current code:
generateFile = function(file){
reader = new FileReader();
reader.onload = function (evt) {
if (checkSize(file.size) && isTypeValid(file.type)) {
scope.$apply(function () {
scope.file = evt.target.result;
file.encoding = Encoding.detect(scope.file);
if (angular.isString(scope.fileName)) {
return scope.fileName = name;
}
});
if (form) {
form.$setDirty();
}
scope.fileArray.push({
name: file.name,
type: file.type,
size: file.size,
date: file.lastModified,
encoding: file.encoding,
file: scope.file
});
--scope.pending;
if (scope.pending === 0){
scope.$emit('file-dropzone-drop-event', scope.fileArray);
scope.fileArray = [];
}
}
};
let fileExtExpression = /\.csv+$/i;
if(fileExtExpression.test(file.name)){
reader.readAsText(file);
}
else{
reader.readAsDataURL(file);
}
++scope.pending;
}
Is this just impossible to do or what am I doing wrong? I even tried to solve this using FileReader.readAsArrayBuffer() and extract the file header from there, but this was way too complex for me and/or didn't seem to work.
Encoding.detectcoming from? Also as far as I know most text editors just "probe" the file for some typical encoded characters and then guess the encoding from that and read it again with that encoding.base64is a byte encoding, not a character encoding. It turns an array of bytes into a string. So when you decode it, you get back an array of bytes; you don't need to know any character encoding for that just yet. Now these bytes can represent a string, in which case they need further decoding, and for this step you will need to know the character encoding. Given a byte array you can make a few educated guesses, this works reasonably well with UTF encodings. But the problem is single-byte encodings, which are impossible to distinguish with certainty.