Using XPath Contains against HTML in Java

Question

I'm scraping values from HTML pages using XPath inside of a java program to get to a specific tag and occasionally using regular expressions to clean up the data I receive.

After some research, I landed on HTML Cleaner ( http://htmlcleaner.sourceforge.net/ ) as the most reliable way to parse raw HTML into a good XML format. HTML Cleaner, however, only supports XPath 1.0 and I find myself needing functions like 'contains'. for instance, in this piece of XML:

<div>
  <td id='1234 foo 5678'>Hello</td>
</div>

I would like to be able to get the text 'Hello' with the following XPath:

//div/td[contains(@id, 'foo')]/text()

Is there any way to get this functionality? I have several ideas, but would prefer not to reinvent the wheel if I don't need to:

If there is a way to call HTML Cleaner's evaluateXPath and return a TagNode (which I have not found), I can use an XML serializer on the returned TagNode and chain together XPaths to achieve the desired functionality.
I could use HTML Cleaner to clean to XML, serialize it back to a string, and use that with another XPath library, but I can't find a good java XPath evaluator that works on a string.
Using TagNode functions like getElementsByAttValue, I could essentially recreate XPath evaluation and insert in the contains functionality using String.contains

Short question: Is there any way to use XPath contains on HTML inside an existing Java Library?

contains is in XPath 1.0: w3.org/TR/xpath/#function-contains — Wayne
– Wayne, Commented Jan 26, 2012 at 17:12
I should have been more specific- HTML cleaner uses a subset of XPath 1.0 that does not allow contains. — Wes Iliff
– Wes Iliff, Commented Jan 26, 2012 at 17:21
My take is that the developers of HTMLCleaner wasted a lot of time writing a completely unnecessary (and non-compliant) XPath implementation. There's no reason to ever use it. See my answer for a complete example. — Wayne
– Wayne, Commented Jan 26, 2012 at 17:36

Wayne · Accepted Answer · 2012-01-28 00:53:05Z

35

Regarding this:

I could use HTML Cleaner to clean to XML, serialize it back to a string, and use that with another XPath library, but I can't find a good java XPath evaluator that works on a string.

This is exactly what I would do (except you don't need to operate on a string (see below)).

A lot of HTML parsers try to do too much. HTMLCleaner, for example, does not properly/completely implement the XPath 1.0 spec (contains (for example) is an XPath 1.0 function). The good news is that you don't need it to. All you need from HTMLCleaner is for it to parse the malformed input. Once you've done that, it's better to use the standard XML interfaces to deal with the resulting (now well-formed) document.

First convert the document into a standard org.w3c.dom.Document like this:

TagNode tagNode = new HtmlCleaner().clean(
        "<div><table><td id='1234 foo 5678'>Hello</td>");
org.w3c.dom.Document doc = new DomSerializer(
        new CleanerProperties()).createDOM(tagNode);

And then use the standard JAXP interfaces to query it:

XPath xpath = XPathFactory.newInstance().newXPath();
String str = (String) xpath.evaluate("//div//td[contains(@id, 'foo')]/text()", 
                       doc, XPathConstants.STRING);
System.out.println(str);

Output:

Hello

edited Jan 28, 2012 at 0:53

answered Jan 26, 2012 at 17:24

Wayne

60.5k15 gold badges135 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Marc Over a year ago

That actually works. Sadly, no maven repo for HtmlCleaner, but the jar is here: sourceforge.net/projects/htmlcleaner/?source=typ_redirect

Aaron Davis Over a year ago

It (HtmlCleaner) is in maven central. search.maven.org/…

Igor G. Over a year ago

But I would still like to evaluate XPath 2.0, which is not possible with JAXP.

Arya Over a year ago

I'm trying to get this to work with Java 11 and I get DomSerializer cannot be resolved to a type

Ranjan Gupta Over a year ago

I tried above solution too but did not get any luck. same error what you got. Did you get any solution .

Collectives™ on Stack Overflow

Using XPath Contains against HTML in Java

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related