3

How can remove the comments and contents of the comments from an html file using Java where the comments are written like:

<!--

Any idea or help needed on this.

1
  • This question should be named "How to remove comments from HTML using Java" Commented May 26, 2009 at 11:52

3 Answers 3

5

Take a look at JTidy, the java port of HTML Tidy. You could override the print methods of the PPrint object to ignore the comment tags.

Sign up to request clarification or add additional context in comments.

Comments

4

If you don't have valid xhtml, which a comment posted reminded me of, you should at first apply jtidy to tidy up the html and make it valid xhtml.

See this for example code on jtidy.

Then I'd convert the html to a DOM instance.

Like so:

final DocumentBuilderFactory newFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder documentBuilder = newFactory.newDocumentBuilder();
Document document = documentBuilder.parse( new InputSource( new StringReader( string ) ) );

Then I'd navigate through the document tree and modify nodes as needed.

1 Comment

Most HTML around is still not XHTML, so JTidy should probably be the first option, not an afterthought.
0

try a simple regex like

String commentless = pageString.replaceAll("<!--[\w\W]*?-->", "");

edit: to explain the regex:

  • <!-- matches the literal comment start
  • [\w\W] matches every character (even newlines) which will be inside the comment
  • *? matches multiple of the 'any character' but matches the smallest amount possible (not greedy)
  • --> closes the comment

9 Comments

A simple regex should be able to do the job - but this one doesn't ... comments are not always opened and closed on the same line. I just found this link on google that seems better: ostermiller.org/findhtmlcomment.html
if you try this, it works. the \w\W catches everything, including newlines, unlike '.'
Not exactly sure why this is downvoted. Regardless of whether or not this particular RegEx works, RegEx IS the way to go here.
No, it isn't. It would remove "comment" from this too: <input type="text" value="<!-- Hello world -->">, which would be incorrect. <!-- doesn't always start the comment.
good point. is it legal to have < in a string? I'm fairly sure > will throw most browsers.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.