1

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API.

For example:

DomRoot = parse("myhtml.html");

for (tags : DomRoot) {
}

Note: this is a HTML document not XHtml.

1
  • please include "parsing" as a tag too Commented Sep 16, 2009 at 14:22

5 Answers 5

4

You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML.

This is <B>bold, <I>bold italic, </b>italic, </i>normal text

gets correctly rewritten as:

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
Sign up to request clarification or add additional context in comments.

1 Comment

TagSoup is very good, especially if you have to parse crappy HTML
2

JTidy should let you do what you want.

Usage is fairly straight forward, but parsing is configurable. e.g.:

InputStream in = ...;
Tidy tidy = new Tidy();
// configure Tidy instance as required
...
...
Document doc = tidy.parseDOM(in, null);
Element root = doc.getDocumentElement();

The JavaDoc is hosted here.

Comments

1

You can take a look at NekoHTML, a Java library that performs a best effort cleaning and tag balancing in your document. It is an easy way to parse a malformed HTML (or a non-valid XML) file.

It is distributed under the Apache 2.0 license.

Comments

0

HTML Parser seems to support conversion from HTML to XML. Then you can build a DOM tree using the usual Java toolchain.

Comments

0

There are several open source tools to parse HTML from Java.

Check http://java-source.net/open-source/html-parsers

Also you can check answers to this question: Reading HTML file to DOM tree using Java It is almost the same...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.