2

I am developing a Node.js app that receives an XHTML snippet (Confluence storage format), should make some modifications to it and then send back the modified XHTML. The XHTML may contain HTML entities (such as &ouml;) and also CDATA sections (such as <![CDATA[test]]>).

The challenge that I’m running into is that with the parsers that I have tried, when I parse the snippet in HTML mode, the CDATA sections break, but when I parse it in XML mode, the HTML entities are not interpreted correctly.

Below is an example how I got this to work in the browser, but how I failed to get it to work using jsdom and cheerio. Is there any other library that I could use to achieve this, or any different way to use jsdom or cheerio?

In the browser

In the browser, I can work with DOMParser in XML mode. Working with the test snippet <span>&ouml;<![CDATA[ä]]></span>, I can wrap it in an XHTML body:

const doc = new DOMParser().parseFromString(`<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><span>&ouml;<![CDATA[ä]]></span></body></html>`, 'application/xml');
doc.querySelector('body').innerHTML;   // <span>ö<![CDATA[ä]]></span>
doc.querySelector('body').textContent; // öä

The XML MIME type ensures that the CDATA section is interpreted correctly, while the XHTML DOCTYPE makes sure that the entities are supported.

jsdom

To achieve the same in Node.js, I attempted to use jsdom. The problem is that when I parse the code in HTML mode, the CDATA section gets converted into a comment, but when I parse it in XML mode, an exception is thrown because of the HTML entity:

import { JSDOM } from 'jsdom';
const xhtml = `<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><span>&ouml;<![CDATA[ä]]></span></body></html>`;

new JSDOM(xhtml).window.document.body.innerHTML; // <span>ö<!--[CDATA[ä]]--></span>
new JSDOM(xhtml).window.document.body.textContent; // ö
new JSDOM(xhtml, { contentType: 'application/xml' }); // Uncaught DOMException [SyntaxError]: about:blank:1:186: undefined entity.

Update: I have reported the problem to jsdom.

cheerio

My preferred method to do DOM modifications in the backend would be cheerio. Using cheerio in HTML mode, the CDATA section gets converted into a comment. In XML mode, the entity is not interpreted but rather double-escaped into &amp;ouml;. In XML mode without decoding entities, the XHTML is preserved correctly, but the entities are not interpreted correctly, which can be seen when getting the text content.

import cheerio from 'cheerio';
const xhtml = `<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><span>&ouml;<![CDATA[ä]]></span></body></html>`;

cheerio.load(xhtml).root().find('body').html(); // <span>ö<!--[CDATA[ä]]--></span>
cheerio.load(xhtml).root().find('body').text(); // ö
cheerio.load(xhtml, { xmlMode: true }).root().find('body').html(); // <span>&amp;ouml;<![CDATA[ä]]></span>
cheerio.load(xhtml, { xmlMode: true }).root().find('body').html(); // &ouml;ä
cheerio.load(xhtml, { xmlMode: true, decodeEntities: false }).root().find('body').html(); // <span>&ouml;<![CDATA[ä]]></span>
cheerio.load(xhtml, { xmlMode: true, decodeEntities: false }).root().find('body').text(); // &ouml;ä

Update: I have reported the problem to cheerio.

2
  • Did you try xmldom? It should provide DOMParser interface and functionality for node.js. Note that you should set mimeType application/xhtml+xml Commented Oct 26, 2021 at 23:01
  • @bigless It also shows an "entity not found" error. Commented Oct 28, 2021 at 7:55

1 Answer 1

2

I was pointed out a workaround for the issue in cheerio:

cheerio.load(xhtml, { xml: { xmlMode: false, recognizeCDATA: true, recognizeSelfClosing: true } });

With these options, I can successfully parse XHTML in a Node.js environment.

In addition to this solution, I noticed that using the DOMParser in the browser has the disadvantage that there are inconsistencies between the browsers. In particular, when using query selectors in combination with XML namespaces, I sometimes had to include the namespace in the query and sometimes not. Because of these inconsistencies, jquery also officially doesn't support XML namespaces. To achieve consistent behaviour between the browsers and also between the frontend, frontend tests and backend, I decided to use cheerio even for parsing XHTML in the browser.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.