0

Following is a fragment of an html document for which I need to associate the "title" - e.g. FILE_BYTES_WRITTEN - with the text() entry in the first succeeding .

The following xpath works great in python lxml:

/td[text()='FILE_BYTES_WRITTEN']/following-sibling::td

The doc fragment:

   <td>HDFS_BYTES_READ</td>
   <td align="right">4,825</td>
   <td align="right">0</td>
   <td align="right">4,825</td>
 </tr>

   <tr>

   <td>FILE_BYTES_WRITTEN</td>
   <td align="right">415,881</td>
   <td align="right">48,133</td>
   <td align="right">464,014</td>
 </tr>

   <tr>

   <td>HDFS_BYTES_WRITTEN</td>
   <td align="right">98,580,205</td>
   <td align="right">2,010</td>
   <td align="right">98,582,215</td>
 </tr>

But when I try to do this in Java I am having less success. I am not sure if there are any java html parsers that can support this. I am presently using HtmlCleaner.

2 Answers 2

1

You can look into HtmlUnit which has nice getByXPath() function. It is a guiless browser. Try to look into examples.

Another one that i use for parsing and like the most is Jsoup which has powerful select(query) function to do these things easily. Check out its selector class documentation. You will find everything you need.

Sign up to request clarification or add additional context in comments.

Comments

0

As a preamble: I will indeed look at HtmlUnit as suggested by @Sage.

In the meantime: I have come up with the following solution:

a) HtmlCleaner actually has a DomSerializer for converting to XHtml:

public static Document toXhtml(String html) throws ParserConfigurationException {
    HtmlCleaner cleaner = new HtmlCleaner();
    TagNode tagNode = cleaner.clean(html);
    DomSerializer domSerializer = new DomSerializer(new CleanerProperties());
    return domSerializer.createDOM(tagNode);
}

b) At the point that we have XHtml we have plenty of options- just use xalan for example..

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.