0

(I've seen similar questions, but I think none of them cater to my specific needs, hence...)

I would like to know if there is a Java library for analysis of real-world (read: incomplete, ill-formed) HTML. By analysis, I mean things like:

  • figuring out the most prominent color in an HTML chunk
  • changing that color to some other color (hence, has to support modification of the HTML as well)
  • pruning out unwanted tags
  • fixing up the HTML to result in a well formed HTML snippet

Parts of the last two are done by libraries such as Jericho, and jTidy. 'Plugins' on top of these would be great.

Thanks in advance!

1
  • Well, after some analysis it seems what I've asked for in the first bullet above is not readily available :( Have to think of some slick algorithm for this... Commented Jan 28, 2010 at 10:38

4 Answers 4

4

You might want to check out TagSoup:

http://home.ccil.org/~cowan/XML/tagsoup/

Sign up to request clarification or add additional context in comments.

1 Comment

None of the libraries offer semantic analysis much. But voted for this as Tagsoup is really impressive nevertheless
2

Well I would tidy it first into valid XML, then using XSLT do a conditional deep copy where I would do the most-prominent-color/pruning/whatever processing you need.

Comments

1

Take a look at JTidy, a Java port of HTML Tidy. It will, depending on what options you choose, fix non-well-formed HTML and otherwise clean it up.

You'll need something else for the colour changing stuff.

1 Comment

Thanks. I'm aware of jTidy. I was looking for something that can do some more semantic analysis on an HTML fragment
0

Maybe you will find something in this list (try TagSoup, NekoHTML, VietSpider HTMLParser).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.