Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.
-
2Typing your title into the Search box, I got the following: stackoverflow.com/search?q=How+to+remove+HTML+tag+in+Java ... did you not get the same while you were posting the question?kdgregory– kdgregory2009-11-09 12:37:12 +00:00Commented Nov 9, 2009 at 12:37
-
2I found no duplicates. These questions care about extracting text from HTML: stackoverflow.com/questions/240546/… stackoverflow.com/questions/832620/stripping-html-tags-in-javatangens– tangens2009-11-10 17:24:37 +00:00Commented Nov 10, 2009 at 17:24
6 Answers
There is JSoup which is a java library made for HTML manipulation. Look at the clean() method and the WhiteList object. Easy to use solution!
5 Comments
String plaintext = Jsoup.parse(html).text();Jsoup.parse(html).text() remove all of the tags and whitespace, leaving you with a long single line of text only, while new HtmlToPlainText().getPlainText(Jsoup.parse(html)) formats the text in a simplistic way, keeping line breaks, paragraphs, bullet points, etc.You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.
With htmlCleaner you can do:
TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
((TagNode)found[0]).removeFromTree();
}
1 Comment
If you just need to remove tags then you can use this regular expression:
content = content.replaceAll("<[^>]+>", "");
It will remove only tags, but not other HTML stuff. For more complex things you should use parser.
EDIT: To avoid problems with HTML comments you can do the following:
content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");
2 Comments
., ^ and $, the s- and m flags can be omitted.you can use this simple code to remove all html tags...
htmlString.replaceAll("\\<.*?\\>", ""))