Java HTML XPath selector

Question

I am trying to find a library like C# htmlagilitypack for java to parse HTML and select elements using XPath.

I have read about many libraries but none of them is standalone XPath selector for HTML, all the libraries that I have found require to parse HTML using their methods like htmlunit.

If someone can guide me with a simple example for XPath 2.0 or 3.0 and HTML parsing I would appreciate it.

I am looking for a library to input a html string and use xpath selectors. Selenium needs to open browser. — Heopas
– Heopas, Commented Apr 9, 2020 at 10:40
Did you try : github.com/code4craft/xsoup .It supports XPath 1.0 and has some other built-in functions. — E.Wiest
– E.Wiest, Commented Apr 9, 2020 at 10:51
For htmlunit you can use a html string as input (see FAQ) to get the page and then work with XPath. — RBRi
– RBRi, Commented Apr 10, 2020 at 11:11

catch32 · Accepted Answer · 2022-02-14 11:04:03Z

1

Java has support for Xpath. Usually, it used for parsing XML files. However, it should work for HTML as well.

HTML sample:

<html lang="en">
<head>
    <title>Index page</title>
</head>
<body>
<div>
    <br/>
    <h1>Hello <span id="my-demo">User!</span></h1>
    <br/>
    <img src="https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG" alt="photo"/>
</div>
</body>
</html>

Code snippet:

public class HtmlXpathParser {
    private DocumentBuilder builder;
    private XPath path;

    public HtmlXpathParser() throws ParserConfigurationException {
        DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
        builder = dbfactory.newDocumentBuilder();
        XPathFactory xpfactory = XPathFactory.newInstance();
        path = xpfactory.newXPath();
    }

    public Optional<String> parse(String fileName) throws SAXException, IOException, XPathExpressionException {
        File file = new File(fileName);

        Document doc = builder.parse(file);
        String result = path.evaluate("//img/@src", doc);

        return Optional.of(result);
    }

    public static void main(String[] args) throws ParserConfigurationException, XPathExpressionException, SAXException, IOException {
        HtmlXpathParser parser = new HtmlXpathParser();

        Optional<String> srcResult = parser.parse("src/main/resources/index.html");
        srcResult.ifPresent(System.out::println);
    }
}

Output:

https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG

It works for XPath version 1. You could use something like xpath2-parser if you will need it.

Useful references:

edited Feb 14, 2022 at 11:04

answered Apr 10, 2020 at 13:15

catch32

18.8k45 gold badges153 silver badges228 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Heopas Over a year ago

Thanks for providing this answer. The first issue that I see is that this code doesn't' t clean bad HTML and the second is that it doesn't support xpath 3.

Collectives™ on Stack Overflow

Java HTML XPath selector

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related