3

I am looking for html content extractor using xpath, I have seen various nodejs module for this like

jsdom, htmlparser2, xpath, cheerio

I found cheerio better for getting data using class, id, tags etc but I am not able to get data by specifying xpath , and by using xpath nodejs module I am able to get data using xpath for smaller html, for longer html it gives different type of error like

entity not found:  @#[line:120,col:9], unclosed xml attribute @#[line:1,col:877]

Note: I have no permission to change html in any way

e.g. if my html is

<html>
<body>

<div>

    <ul id="fruits">
        <li class="apple">Apple</li>
        <li class="orange">Orange</li>
        <li class="pear">Pear</li>
    </ul>

</div>

</body>


</html>

if I am using this and giving this xpath //*[@id="fruits"]/li[2] to find element using xpath nodejs module, I am not getting any error and got the result as Orange using xpath nodejs module, but if I am using html of this page http://www.infotaxi.org/india_taxi/ahmedabad_taxi.htm

(which is quite longer), and accessing the part of text using xpath

//*[@id="navlistmeniu"]/li[3]/a/b, 

I am getting error

entity not found:  @#[line:120,col:9]

Using Cheerio I am able to extract data using class, id, tags etc. and not with xpath

Please help????

2
  • Is there a reason you need to use XPath? Isn't the point of cheerio to use normal selectors? $(#navlistmeniu > li).eq(3).find('a > b'); Commented May 18, 2015 at 16:40
  • Hi, This is also the great way, but I have only xpath available and I have require to convert my xpath into this way, is there any way to formulate this. Actually I have xpath of any child like xpath of this <li class="orange">Orange</li>, and I have required to get content of all the three i.e. my output should be Apple, Orange, Pear, i.e. my output should be construct from the parent of the given child, I hope you can understands, what I am saying Commented May 20, 2015 at 4:25

1 Answer 1

1

I think this is your answer xpath-html, test it yourself:

const xpath = require("xpath-html");
const node = xpath.fromPageSource(html).findElement("//*[contains(text(), 'with love')]");
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.