1

I am parsing a collection of HTML documents with the Java Swing HTML parsing libraries and I am trying to isolate the text between <title> tags so that I can use them to identify the documents but I am having a hard time accomplishing that since the handleStartTag method doesn't have access to the text inside of the tags

1
  • I am not familiar with those libraries, but can you start grabbing text there and then stop when you handle an end tag? Commented Jun 3, 2010 at 19:33

1 Answer 1

1

You can use XPath to pull out data from HTML:

String html = //...

//read the HTML into a DOM
StreamSource source = new StreamSource(new StringReader(html));
DOMResult result = new DOMResult();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(source, result);
Node root = result.getNode();

//use XPath to get the title
XPath xpath = XPathFactory.newInstance().newXPath();
String title = xpath.evaluate("/html/title", root);

However, the HTML must be well formed XHTML for this to work. For example, the "<br>" tag is valid in HTML, but is invalid in XHTML because it is not closed. It must be "<br />" to be valid in XHTML.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.