15

Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.

2

6 Answers 6

24

There is JSoup which is a java library made for HTML manipulation. Look at the clean() method and the WhiteList object. Easy to use solution!

Sign up to request clarification or add additional context in comments.

5 Comments

WOW, you sir, really made my day, i like that, YES! Markdownj, Markdown4J, htmlCleaner.. all of them is ***** sorry.. JSoup is the one and only where you really achieve that with a one-liner: String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html));
A shorter code would be String plaintext = Jsoup.parse(html).text();
@jrarama - Not at all. Jsoup.parse(html).text() remove all of the tags and whitespace, leaving you with a long single line of text only, while new HtmlToPlainText().getPlainText(Jsoup.parse(html)) formats the text in a simplistic way, keeping line breaks, paragraphs, bullet points, etc.
@isapir: HtmlToPlainText is not incuded in mvnrepository.com/artifact/org.jsoup/jsoup/1.11.3
That's because HtmlToPlainText is an example, see github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/…
20

You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.

With htmlCleaner you can do:

TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
    ((TagNode)found[0]).removeFromTree();
}

1 Comment

Do we need to get any library in-order to use this above code? And root.evaluateXPath( "//div[id='something']" ); in this "something " could be any id rite? please let me know. thanks
6

If you just need to remove tags then you can use this regular expression:

content = content.replaceAll("<[^>]+>", "");

It will remove only tags, but not other HTML stuff. For more complex things you should use parser.

EDIT: To avoid problems with HTML comments you can do the following:

content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");

2 Comments

Since you do not use any of the meat characters ., ^ and $, the s- and m flags can be omitted.
This regex is liable to cause mangling if the HTML contains XML comments with embedded '<' or '>' characters.
4

No. Regular expressions can not by definition parse HTML.

You could use a regex to s/<[^>]*\>// or something naive like that but it's going to be insufficient, especially if you're interested in removing the contents of tags.

As another poster said, use an actual HTML parser.

Comments

2

You don't need any HTML parser. The below code removes all HTML comments:

htmlString = htmlString.replaceAll("(?s)<!--.*?-->", "");

Comments

0

you can use this simple code to remove all html tags...

htmlString.replaceAll("\\<.*?\\>", ""))

3 Comments

This will only remove opening tags and leave closing tags unhandled.
i never would do a job like that on my own - parsing html into plain-text is really a though job dude..
It worked for me but maybe depends on the complexity of the tags, comments, scripts, etc. So, for a complex case maybe a html library should be better.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.