Parse UTF-8 XML in javascript

Question

I'm trying to load and parse a simple utf-8-encoded XML file in javascript using node and the xpath and xmldom packages. There are no XML namespaces used and the same XML parsed when converted to ASCII. I can see in the debugger in VS Code that the string has embedded spaces in between each character (surely due to loading the utf-8 file incorrectly) but I can't find a way to properly load and parse the utf-8 file.

Code:

var xpath = require('xpath')
  , dom = require('xmldom').DOMParser;

const fs = require('fs');

var myXml = "path_to_my_file.xml";

var xmlContents = fs.readFileSync(myXml, 'utf8').toString();

// this line causes errors parsing every single tag as the tag names have spaces in them from improper utf-8 decoding
var doc = new dom().parseFromString(xmlContents, 'application/xml');
var cvNode = xpath.select1("//MyTag", doc);

console.log(cvNode.textContent);

The code works fine if the file is ASCII (textContent has the proper data), but if it is UTF-8 then there are a number of parsing errors and cvNode is undefined.

Is there a proper way to parse UTF-8 XML in node/javascript? I can't for the life of me find a decent example.

Have you tried 'utf8' without the minus? That is the correct value to use for utf-8 encoding in this API. On the other hand, when you see additional white spaces between each letter this suggests that the file isn't actually encoded using utf-8 but uses an encoding with 16 bits base. Have you tried 'utf16le'? — NineBerry
– NineBerry, Commented Nov 19, 2019 at 18:29
@NineBerry 'utf16le' did the trick. Thanks so much. If you want to add an official answer I will mark it as such. — Mike Marshall
– Mike Marshall, Commented Nov 19, 2019 at 18:35

NineBerry · Accepted Answer · 2019-11-19 18:37:01Z

1

When you see additional white spaces between each letter, this suggests that the file isn't actually encoded using utf-8 but uses a 16 bit unicode encoding.

Try 'utf16le'.

For a list of supported encodings see Buffers and Character Encodings.

answered Nov 19, 2019 at 18:37

NineBerry

28.8k4 gold badges68 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parse UTF-8 XML in javascript

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related