How to parse and modify XHTML in Node.js (supporting HTML entities and CDATA sections)?

Question

I am developing a Node.js app that receives an XHTML snippet (Confluence storage format), should make some modifications to it and then send back the modified XHTML. The XHTML may contain HTML entities (such as ö) and also CDATA sections (such as <![CDATA[test]]>).

The challenge that I’m running into is that with the parsers that I have tried, when I parse the snippet in HTML mode, the CDATA sections break, but when I parse it in XML mode, the HTML entities are not interpreted correctly.

Below is an example how I got this to work in the browser, but how I failed to get it to work using jsdom and cheerio. Is there any other library that I could use to achieve this, or any different way to use jsdom or cheerio?

In the browser

In the browser, I can work with DOMParser in XML mode. Working with the test snippet <span>ö<![CDATA[ä]]></span>, I can wrap it in an XHTML body:

const doc = new DOMParser().parseFromString(`<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><span>&ouml;<![CDATA[ä]]></span></body></html>`, 'application/xml');
doc.querySelector('body').innerHTML;   // <span>ö<![CDATA[ä]]></span>
doc.querySelector('body').textContent; // öä

The XML MIME type ensures that the CDATA section is interpreted correctly, while the XHTML DOCTYPE makes sure that the entities are supported.

jsdom

To achieve the same in Node.js, I attempted to use jsdom. The problem is that when I parse the code in HTML mode, the CDATA section gets converted into a comment, but when I parse it in XML mode, an exception is thrown because of the HTML entity:

import { JSDOM } from 'jsdom';
const xhtml = `<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><span>&ouml;<![CDATA[ä]]></span></body></html>`;

new JSDOM(xhtml).window.document.body.innerHTML; // <span>ö<!--[CDATA[ä]]--></span>
new JSDOM(xhtml).window.document.body.textContent; // ö
new JSDOM(xhtml, { contentType: 'application/xml' }); // Uncaught DOMException [SyntaxError]: about:blank:1:186: undefined entity.

Update: I have reported the problem to jsdom.

cheerio

My preferred method to do DOM modifications in the backend would be cheerio. Using cheerio in HTML mode, the CDATA section gets converted into a comment. In XML mode, the entity is not interpreted but rather double-escaped into &ouml;. In XML mode without decoding entities, the XHTML is preserved correctly, but the entities are not interpreted correctly, which can be seen when getting the text content.

import cheerio from 'cheerio';
const xhtml = `<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><span>&ouml;<![CDATA[ä]]></span></body></html>`;

cheerio.load(xhtml).root().find('body').html(); // <span>ö<!--[CDATA[ä]]--></span>
cheerio.load(xhtml).root().find('body').text(); // ö
cheerio.load(xhtml, { xmlMode: true }).root().find('body').html(); // <span>&amp;ouml;<![CDATA[ä]]></span>
cheerio.load(xhtml, { xmlMode: true }).root().find('body').html(); // &ouml;ä
cheerio.load(xhtml, { xmlMode: true, decodeEntities: false }).root().find('body').html(); // <span>&ouml;<![CDATA[ä]]></span>
cheerio.load(xhtml, { xmlMode: true, decodeEntities: false }).root().find('body').text(); // &ouml;ä

Update: I have reported the problem to cheerio.

Did you try xmldom? It should provide DOMParser interface and functionality for node.js. Note that you should set mimeType application/xhtml+xml — bigless
– bigless, Commented Oct 26, 2021 at 23:01

cdauth · Accepted Answer · 2021-10-29 12:16:53Z

2

I was pointed out a workaround for the issue in cheerio:

cheerio.load(xhtml, { xml: { xmlMode: false, recognizeCDATA: true, recognizeSelfClosing: true } });

With these options, I can successfully parse XHTML in a Node.js environment.

In addition to this solution, I noticed that using the DOMParser in the browser has the disadvantage that there are inconsistencies between the browsers. In particular, when using query selectors in combination with XML namespaces, I sometimes had to include the namespace in the query and sometimes not. Because of these inconsistencies, jquery also officially doesn't support XML namespaces. To achieve consistent behaviour between the browsers and also between the frontend, frontend tests and backend, I decided to use cheerio even for parsing XHTML in the browser.

answered Oct 29, 2021 at 12:16

cdauth

7,9693 gold badges49 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to parse and modify XHTML in Node.js (supporting HTML entities and CDATA sections)?

In the browser

jsdom

cheerio

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

In the browser

jsdom

cheerio

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related