Help with Java Swing HTML parsing

Question

I am parsing a collection of HTML documents with the Java Swing HTML parsing libraries and I am trying to isolate the text between <title> tags so that I can use them to identify the documents but I am having a hard time accomplishing that since the handleStartTag method doesn't have access to the text inside of the tags

I am not familiar with those libraries, but can you start grabbing text there and then stop when you handle an end tag? — Michael Myers
– Michael Myers ♦, Commented Jun 3, 2010 at 19:33

Michael · Accepted Answer · 2010-06-03 19:43:19Z

1

You can use XPath to pull out data from HTML:

String html = //...

//read the HTML into a DOM
StreamSource source = new StreamSource(new StringReader(html));
DOMResult result = new DOMResult();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(source, result);
Node root = result.getNode();

//use XPath to get the title
XPath xpath = XPathFactory.newInstance().newXPath();
String title = xpath.evaluate("/html/title", root);

However, the HTML must be well formed XHTML for this to work. For example, the "<br>" tag is valid in HTML, but is invalid in XHTML because it is not closed. It must be "<br />" to be valid in XHTML.

answered Jun 3, 2010 at 19:43

Michael

35.5k17 gold badges79 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Help with Java Swing HTML parsing

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related