4

Here's my case: I'm working with a very big project that contains lots of files. Some of these files are encoded in UTF-8, other in ANSI. We need to convert all the files to UTF-8, because we decided this will be the default in our next projects. This is a big concern because we're Brazilian and we have common words using characters like á, ç, ê, ü, etc. So having multiple files in multiple charset-encodes generated a serious issue.

Anyway, I've come to this JS file that converts ANSI files to UTF-8, copying them to another folder and preserving the originals:

var indir = "in";
var outdir = "out";
function ansiToUtf8(fin, fout) {
    var ansi = WScript.CreateObject("ADODB.Stream");
    ansi.Open();
    ansi.Charset = "x-ansi";
    ansi.LoadFromFile(fin);
    var utf8 = WScript.CreateObject("ADODB.Stream");
    utf8.Open();
    utf8.Charset = "UTF-8";
    utf8.WriteText(ansi.ReadText());
    utf8.SaveToFile(fout, 2 /*adSaveCreateOverWrite*/);
    ansi.Close();
    utf8.Close();
}
var fso = WScript.CreateObject("Scripting.FileSystemObject");
var folder = fso.GetFolder(indir);
var fc = new Enumerator(folder.files);
for (; !fc.atEnd(); fc.moveNext()) {
    var file = fc.item();
    ansiToUtf8(indir+"\\"+file.name, outdir+"\\"+file.name);
}

which I run using this in command line

cscript /Nologo ansi2utf8.js

The problem is that this script runs through all the files, even the ones that are already in UTF-8, and this results in breaking my special characters. So I need to check if the file encoding is already UTF-8, and run my code only if it is ANSI. How can I do that?

Also, my script is running only through the 'in' folder. I'm still thinking in a easy way to make it go inside folders that are in this folder and run there too.

3
  • 1
    What environment are you doing this on? My first thought is that JS is probably not the right tool for the job here. Commented May 20, 2011 at 15:32
  • I'm using Windows 7 and I code in PHP/JavaScript. I don't know if this can be reproduced using another programming language, but it's not the case, because I probably won't know how to do it. Commented May 20, 2011 at 15:56
  • If you do PHP, possibly consider the mbstring library: php.net/manual/en/book.mbstring.php Commented May 20, 2011 at 20:52

1 Answer 1

2

Does your UTF-8 files have a byte order mark? In that case you could simply check the value of the first 3 bytes to determine if the files are UTF-8 or not. Otherwise the standard method is to check if the file is legal UTF-8 all the way through, if so it is most likely supposed to be read as UTF-8.

Sign up to request clarification or add additional context in comments.

1 Comment

"the standard method is to check if the file is legal UTF-8 all the way through" Any tips on how to do this...?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.