Parsing HTML/XML with XPath in node.js

Question

I'm trying to write an XPath statement to fetch the contents of each row in a table, but only when the 2nd column of each row is not set to "TBA". The page I am working off this page. I am new to using XPath.

I've come up with the following statement, which I've managed to test successfully (or appears successful anyway) with an online XPath tester, but have been unable to figure out how to apply it in node.js:

//*[@id="body_column_left"]/div[4]/table/tbody/tr/[not(contains(./td[2], 'TBA'))]

This is my attempt below, I've tried variations but I can't get it to even validate as a valid XPath statement and as a result I've been lost in not very helpful stack traces:

var fs = require('fs');
var xpath = require('xpath');
var parse5 = require('parse5');
var xmlser = require('xmlserializer');
var dom = require('xmldom').DOMParser;
var request = require('request');

var getHTML = function (url, callback) {
    request(url, function (error, response, body) {
        if (!error && response.statusCode == 200) {
            return callback(body) // return the HTML
        }
    })
}

getHTML("http://au.cybergamer.com/pc/csgo/ladder/scheduled/", function (html) {
    var parser = new parse5.Parser();
    var document = parser.parse(html.toString());
    var xhtml = xmlser.serializeToString(document);
    var doc = new dom().parseFromString(xhtml);
    var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});    
    var nodes = select("//x:*[@id=\"body_column_left\"]/div[4]/table/tbody/tr/[not(contains(./td[2], 'TBA'))]", doc);
    console.log(nodes);    
});

Any help would be appreciated!

Thanks for your response @hassansin, I will take a look into using cheerio. — anditpainsme
– anditpainsme, Commented Jul 14, 2015 at 6:20

anditpainsme · Accepted Answer · 2015-07-27 02:40:46Z

2

I ended up solving this issue using cheerioinstead of xpath:

See below:

    var $ = cheerio.load(html);
    $('.s_grad br').replaceWith("\n");
    $('.s_grad thead').remove();
    $('.s_grad tr').each(function(i, elem) {
        rows[i] = $(this).text();
        rows[i] = rows[i].replace(/^\s*[\r\n]/gm, ""); // remove empty newlines
        matches.push(new match($(this).find('a').attr('href').substring(7).slice(0, -1))) // create matches
    });

answered Jul 27, 2015 at 2:40

anditpainsme

6591 gold badge7 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Hieu Van · Accepted Answer · 2020-05-05 09:27:47Z

-1

How about using this xpath-html, I loved its simplicity.

const xpath = require("xpath-html");

const nodes = xpath
  .fromPageSource(html)
  .findElements("//img[starts-with(@src, 'https://cloud.shopback.com')]");

answered May 5, 2020 at 9:27

Hieu Van

572 bronze badges

2 Comments

Amna Arshad Over a year ago

While using this library Im getting this error TypeError: this.html.charCodeAt is not a function. Whats the cause of this.

Hieu Van Over a year ago

Hi @AmnaArshad, can you create an issue on its GitHub repo?

Collectives™ on Stack Overflow

Parsing HTML/XML with XPath in node.js

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related