4

I have a XML string encoded in big5:

atob('PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iYmlnNSIgPz48dGl0bGU+pKSk5TwvdGl0bGU+')

(<?xml version="1.0" encoding="big5" ?><title>中文</title> in UTF-8.)

I'd like to extract the content of <title>. How can I do that with pure Javascript in browsers? Better to have lightweight solutions without jquery or emscripten.

Have tried DOMParser:

(new DOMParser()).parseFromString(atob('PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iYmlnNSIgPz48dGl0bGU+pKSk5TwvdGl0bGU+'), 'text/xml')

But neither Chromium nor Firefox respects the encoding attribute. Is it a standard that DOMParser supports UTF-8 only?

3
  • Maybe a silly question that exposes my ignorance, but how are you checking that the encoding attribute is not respected? Commented Jul 20, 2016 at 18:44
  • Also, in your real case, is the string encoded as big5, and then base64, as in your example here? Commented Jul 20, 2016 at 20:04
  • As a reference for future visitors, real codes are here: github.com/yan12125/chrome_newtab/blob/…. This is an old commit of my project, which now uses TextEncoder mentioned below. Commented Jul 26, 2016 at 3:39

2 Answers 2

5

I suspect the issue isn't DOMParser, but atob, which can't properly decode what was originally a non-ascii string.*

You will need to use another method to get at the original bytes, such as using https://github.com/danguer/blog-examples/blob/master/js/base64-binary.js

var encoded = 'PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iYmlnNSIgPz48dGl0bGU+pKSk5TwvdGl0bGU+';
var bytes = Base64Binary.decode(encoded);

and then some method to convert the bytes (i.e. decode the big5 data) into a Javascript string. For Firefox / Chrome, you can use TextDecoder:

var decoder = new TextDecoder('big5'); 
var decoded = decoder.decode(bytes);

And then pass to DOMParser

var dom = (new DOMParser()).parseFromString(decoded, 'text/xml');
var title = dom.children[0].textContent;

You can see this at https://plnkr.co/edit/TBspXlF2vNbNaKq8UxhW?p=preview


*One way of understanding why: atob doesn't take the encoding of the original string as a parameter, so while it must internally decode base64 encoded data to bytes, it has to make an assumption on what character encoding those bytes are to then give you a Javascript string of characters, which I believe is internally encoded as UTF-16.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for that. TextEncoder/TextDecoder is indeed what I used later. atob is problematic, as well as DOMParser. In a bug report at bugzilla.mozilla.org/show_bug.cgi?id=1287071, a Mozilla developer has confirmed that DOMParser assumes all inputs to be UTF-8. In fact from dom/base/DOMParser.cpp of mozilla-central, it's easy to see that parseFromString uses a hard-coded encoding UTF-8. The TextDecoder approach requires knowing the encoding a priori. It's less than ideal but sufficient for my project.
Just for reference I think it converts from UTF-16 to UTF-8 internally github.com/mozilla/gecko-dev/blob/master/dom/base/… . Not sure that makes a difference to your situation, admittedly.
Thanks for that. Seems all Javascript strings are assumed to be UTF-16 on the C level?
I believe so. (Although slightly strange to say "assumed"... they are UTF-16).
0

related: parse document from non-utf8 html

/**
* parse html document from http response. \
* also handle non-utf8 data.
*
* use this instead of
* ```
* const html = await response.text()
* const doc = new DOMParser().parseFromString(html, "text/html");
* ```
*
* @param {Response} response
* @return {Document}
*/
async function documentOfResponse(response) {
  // example content-type: text/html; charset=ISO-8859-1
  const type = response.headers.get("content-type").split(";")[0] || "text/html"
  const charset = (response.headers.get("content-type").match(/;\s*charset=(.*)(?:;|$)/) || [])[1]
  let html = ""
  if (charset && charset != "UTF-8") { // TODO check more? utf-8, utf8, UTF8, ...
    const decoder = new TextDecoder(charset)
    const buffer = await response.arrayBuffer()
    html = decoder.decode(buffer) // convert to utf8
  }
  else {
    html = await response.text()
  }
  return new DOMParser().parseFromString(html, type)
}

// demo
const response = await fetch("https://github.com/")
const doc = await documentOfResponse(response)
const title = doc.querySelector("title")
console.log(title)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.