Library to query HTML with XPath in Java?

Question

Can anyone recommend me a java library to allow me XPath Queries over URLs? I've tried JAXP without success.

Thank you.

See stackoverflow.com/questions/9022140/… - not quite a duplicate as it asks about specific XPath functionality but there are better answers there. — Mark Butler
– Mark Butler, Commented Jan 7, 2013 at 0:34
@Reonarudo I am in the same situation as you were when you asked this question. There are many possible suggestions/solutions in the answers, but I would like to know which solution(library) you used and did it work out the way you wanted it ? — Uther Pendragon
– Uther Pendragon, Commented Jun 20, 2015 at 19:08
@UtherPendragon I'm sorry but this was a long time ago and I cannot recall which project was this. Anyway there should be newer/better libraries available nowadays. — Leonardo Marques
– Leonardo Marques, Commented Jun 23, 2015 at 12:14

Community · Accepted Answer · 2017-05-23 11:54:19Z

8

There are several different approaches to this documented on the Web:

Using HtmlCleaner

HtmlCleaner / Java DOM parser - Using XPath Contains against HTML in Java (This is the way I recommend)
HtmlCleaner itself has a built in utility supporting XPath - See the javadocs http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/XPather.html or this example http://thinkandroid.wordpress.com/2010/01/05/using-xpath-and-html-cleaner-to-parse-html-xml/

Using Jericho

Jericho and Jaxen http://sujitpal.blogspot.com/2009/04/xpath-over-html-using-jericho-and-jaxen.html

I have tried a few different variations of these approaches, i.e. HtmlParser plus the Java DOM parser, and JSoup plus Jaxen, but the combination that worked best is HtmlCleaner plus the Java DOM parser. The next best combination was Jericho plus Jaxen.

edited May 23, 2017 at 11:54

CommunityBot

11 silver badge

answered Jan 7, 2013 at 0:33

Mark Butler

4,4012 gold badges42 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

yanchenko Over a year ago

Note that on Android 4.2.2 HtmlCleaner 2.5 turned out to be 4x slower compared to jSoup 1.7.2.

sibbl Over a year ago

Note that HtmlCleaner only supports XPath 1.0.

Andrew Scott Evans Over a year ago

HTML Cleaner + DOM Serializer + Threading = Really bad memory leak

Artem Barger · Accepted Answer · 2010-07-29 10:02:46Z

6

jsoup, Java HTML Parser Very similar to jQuery syntax way.

answered Jul 29, 2010 at 10:02

Artem Barger

41.3k9 gold badges61 silver badges81 bronze badges

5 Comments

Artem Barger Over a year ago

I'm not sure. It does much simpler queries, which xpath based. you can read some documentation and there are a lot of cool examples, explaining how to run that queries.

brabec Over a year ago

jsoup (at least in version 1.7.3) doesn't suppport XPath.

phil Over a year ago

jsoup use css/jQuery syntax way ,which is similar as and better than XPath

Neil McGuigan Over a year ago

CSS Selectors are not better than XPath. There are some things which you can select in XPath but not CSS Selectors

Jonathan Hedley Over a year ago

jsoup now supports xpath, as well as CSS selectors. Since September 2021 in jsoup 1.14.3.

bigbounty · Accepted Answer · 2020-06-30 11:47:34Z

2

Use Xsoup. According to the docs, it's faster than HtmlCleaner. Example

 @Test
    public void testSelect() {

        String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
                "<table><tr><td>a</td><td>b</td></tr></table></html>";

        Document document = Jsoup.parse(html);

        String result = Xsoup.compile("//a/@href").evaluate(document).get();
        Assert.assertEquals("https://github.com", result);

        List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
        Assert.assertEquals("a", list.get(0));
        Assert.assertEquals("b", list.get(1));
    }

Link to Xsoup - https://github.com/code4craft/xsoup

answered Jun 30, 2020 at 11:47

bigbounty

17.5k7 gold badges45 silver badges76 bronze badges

Comments

Martin Honnen · Accepted Answer · 2010-07-29 10:51:38Z

1

You could use TagSoup together with Saxon. That way you simply replace any XML SAX parser used with TagSoup and the XPath 2.0 or XSLT 2.0 or XQuery 1.0 implementation works as usual.

answered Jul 29, 2010 at 10:51

Martin Honnen

169k6 gold badges100 silver badges122 bronze badges

Comments

Tassos Bassoukos · Accepted Answer · 2010-07-29 10:00:29Z

0

I've used JTidy to make HTML into a proper DOM, then used plain XPath to query the DOM.

If you want to do cross-document/cross-URL queries, better use JTidy with XQuery.

answered Jul 29, 2010 at 10:00

Tassos Bassoukos

16.2k2 gold badges39 silver badges42 bronze badges

Collectives™ on Stack Overflow

Library to query HTML with XPath in Java?

5 Answers 5

3 Comments

5 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

5 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related