0

I am trying to find a library like C# htmlagilitypack for java to parse HTML and select elements using XPath.

I have read about many libraries but none of them is standalone XPath selector for HTML, all the libraries that I have found require to parse HTML using their methods like htmlunit.

If someone can guide me with a simple example for XPath 2.0 or 3.0 and HTML parsing I would appreciate it.

8
  • Selenium works with selecting xpathes from html Commented Apr 9, 2020 at 10:37
  • I am looking for a library to input a html string and use xpath selectors. Selenium needs to open browser. Commented Apr 9, 2020 at 10:40
  • Did you try : github.com/code4craft/xsoup .It supports XPath 1.0 and has some other built-in functions. Commented Apr 9, 2020 at 10:51
  • 1
    Saxon-HE 's s9api seems the way to go then. Commented Apr 9, 2020 at 15:01
  • 1
    For htmlunit you can use a html string as input (see FAQ) to get the page and then work with XPath. Commented Apr 10, 2020 at 11:11

1 Answer 1

1

Java has support for Xpath. Usually, it used for parsing XML files. However, it should work for HTML as well.

HTML sample:

<html lang="en">
<head>
    <title>Index page</title>
</head>
<body>
<div>
    <br/>
    <h1>Hello <span id="my-demo">User!</span></h1>
    <br/>
    <img src="https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG" alt="photo"/>
</div>
</body>
</html>

Code snippet:

public class HtmlXpathParser {
    private DocumentBuilder builder;
    private XPath path;

    public HtmlXpathParser() throws ParserConfigurationException {
        DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
        builder = dbfactory.newDocumentBuilder();
        XPathFactory xpfactory = XPathFactory.newInstance();
        path = xpfactory.newXPath();
    }

    public Optional<String> parse(String fileName) throws SAXException, IOException, XPathExpressionException {
        File file = new File(fileName);

        Document doc = builder.parse(file);
        String result = path.evaluate("//img/@src", doc);

        return Optional.of(result);
    }

    public static void main(String[] args) throws ParserConfigurationException, XPathExpressionException, SAXException, IOException {
        HtmlXpathParser parser = new HtmlXpathParser();

        Optional<String> srcResult = parser.parse("src/main/resources/index.html");
        srcResult.ifPresent(System.out::println);
    }
}

Output:

https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG

It works for XPath version 1. You could use something like xpath2-parser if you will need it.

Useful references:

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for providing this answer. The first issue that I see is that this code doesn't' t clean bad HTML and the second is that it doesn't support xpath 3.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.