Using XPath in node.js

Question

I am building a little document parser in node.js. To test, I have a raw HTML file, that is generally downloaded from the real website when the application executes.

I want to extract the first code example from each section of the Console.WriteLine that matches my constraint - it has to be written in C#. To do that, I have this sample XPath:

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]

If I test the XPath online, I get the expected results, which is in this Gist.

In my node.js application, I am using xmldom and xpath to try and parse that exact same information out:

var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var sampleNodes = xpath.select(exampleLookup,doc);

This does not return anything, however.

What might be going on here?

Daniel Haley · Accepted Answer · 2017-11-24 03:41:55Z

5

This is most likely caused by the default namespace (xmlns="http://www.w3.org/1999/xhtml") in your HTML (XHTML).

Looking at the xpath docs, you should be able to bind the namespace to a prefix using useNamespaces and use the prefix in your xpath (untested)...

var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::x:div/following-sibling::x:div/x:pre[position()>1]/x:code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
var sampleNodes = xpath.select(exampleLookup,doc);

Instead of binding the namespace to a prefix, you could also use local-name() in your XPath, but I wouldn't recommend it. This is also covered in the docs.

Example...

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::*[local-name()='div']/following-sibling::*[local-name()='div']/*[local-name()='pre'][position()>1]/*[local-name()='code'][contains(@class,'lang-csharp')]

answered Nov 24, 2017 at 3:41

Daniel Haley

53.1k7 gold badges75 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Design by Adrian Over a year ago

"This is most likely caused by the default namespace". What should it be instead? What is "x" in your example?

Daniel Haley Over a year ago

@DesignbyAdrian - An xml namespace that's declared without a prefix to bind to (like xmlns="http://www.w3.org/1999/xhtml") is said to be a default namespace. It's not that it should be something else instead. The "x" is the arbitrary prefix that I used to bind the namespace uri (http://www.w3.org/1999/xhtml) to. It could've been xhtml, foo, or any other valid prefix name (NCName; see here).

Hieu Van · Accepted Answer · 2020-05-05 08:58:20Z

3

There is a library xpath-html that can help you using XPath to query HTML, with minimal efforts and lines of code.

const fs = require("fs");
const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");

const xpath = require("xpath-html");
const node = xpath.fromPageSource(html).findElement("//*[contains(text(), 'with love')]");

console.log(`The matched tag name is "${node.getTagName()}"`);
console.log(`Your full text is "${node.getText()}"`);

answered May 5, 2020 at 8:58

Hieu Van

572 bronze badges

Collectives™ on Stack Overflow

Using XPath in node.js

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related