4

I am building a little document parser in node.js. To test, I have a raw HTML file, that is generally downloaded from the real website when the application executes.

I want to extract the first code example from each section of the Console.WriteLine that matches my constraint - it has to be written in C#. To do that, I have this sample XPath:

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]

If I test the XPath online, I get the expected results, which is in this Gist.

In my node.js application, I am using xmldom and xpath to try and parse that exact same information out:

var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var sampleNodes = xpath.select(exampleLookup,doc);

This does not return anything, however.

What might be going on here?

2 Answers 2

5

This is most likely caused by the default namespace (xmlns="http://www.w3.org/1999/xhtml") in your HTML (XHTML).

Looking at the xpath docs, you should be able to bind the namespace to a prefix using useNamespaces and use the prefix in your xpath (untested)...

var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::x:div/following-sibling::x:div/x:pre[position()>1]/x:code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
var sampleNodes = xpath.select(exampleLookup,doc);

Instead of binding the namespace to a prefix, you could also use local-name() in your XPath, but I wouldn't recommend it. This is also covered in the docs.

Example...

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::*[local-name()='div']/following-sibling::*[local-name()='div']/*[local-name()='pre'][position()>1]/*[local-name()='code'][contains(@class,'lang-csharp')]
Sign up to request clarification or add additional context in comments.

2 Comments

"This is most likely caused by the default namespace". What should it be instead? What is "x" in your example?
@DesignbyAdrian - An xml namespace that's declared without a prefix to bind to (like xmlns="http://www.w3.org/1999/xhtml") is said to be a default namespace. It's not that it should be something else instead. The "x" is the arbitrary prefix that I used to bind the namespace uri (http://www.w3.org/1999/xhtml) to. It could've been xhtml, foo, or any other valid prefix name (NCName; see here).
3

There is a library xpath-html that can help you using XPath to query HTML, with minimal efforts and lines of code.

const fs = require("fs");
const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");

const xpath = require("xpath-html");
const node = xpath.fromPageSource(html).findElement("//*[contains(text(), 'with love')]");

console.log(`The matched tag name is "${node.getTagName()}"`);
console.log(`Your full text is "${node.getText()}"`);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.