How to remove HTML tag in Java [duplicate]

Question

Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.

Typing your title into the Search box, I got the following: stackoverflow.com/search?q=How+to+remove+HTML+tag+in+Java ... did you not get the same while you were posting the question? — kdgregory
– kdgregory, Commented Nov 9, 2009 at 12:37
I found no duplicates. These questions care about extracting text from HTML: stackoverflow.com/questions/240546/… stackoverflow.com/questions/832620/stripping-html-tags-in-java — tangens
– tangens, Commented Nov 10, 2009 at 17:24

Alex · Accepted Answer · 2012-01-27 17:26:01Z

24

There is JSoup which is a java library made for HTML manipulation. Look at the clean() method and the WhiteList object. Easy to use solution!

edited Jan 27, 2012 at 17:26

Alex

5,7457 gold badges39 silver badges59 bronze badges

answered Jan 27, 2012 at 16:40

Simon

3613 silver badges5 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

jebbie Over a year ago

WOW, you sir, really made my day, i like that, YES! Markdownj, Markdown4J, htmlCleaner.. all of them is ***** sorry.. JSoup is the one and only where you really achieve that with a one-liner: String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html));

jrarama Over a year ago

A shorter code would be String plaintext = Jsoup.parse(html).text();

isapir Over a year ago

@jrarama - Not at all. Jsoup.parse(html).text() remove all of the tags and whitespace, leaving you with a long single line of text only, while new HtmlToPlainText().getPlainText(Jsoup.parse(html)) formats the text in a simplistic way, keeping line breaks, paragraphs, bullet points, etc.

Marco Sulla Over a year ago

@isapir: HtmlToPlainText is not incuded in mvnrepository.com/artifact/org.jsoup/jsoup/1.11.3

ChrLipp Over a year ago

That's because HtmlToPlainText is an example, see github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/…

tangens · Accepted Answer · 2009-11-09 06:05:36Z

20

You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.

With htmlCleaner you can do:

TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
    ((TagNode)found[0]).removeFromTree();
}

answered Nov 9, 2009 at 6:05

tangens

39.8k21 gold badges128 silver badges140 bronze badges

1 Comment

Geet taunk Over a year ago

Do we need to get any library in-order to use this above code? And root.evaluateXPath( "//div[id='something']" ); in this "something " could be any id rite? please let me know. thanks

Andrey Adamovich · Accepted Answer · 2009-11-09 12:40:30Z

6

If you just need to remove tags then you can use this regular expression:

content = content.replaceAll("<[^>]+>", "");

It will remove only tags, but not other HTML stuff. For more complex things you should use parser.

EDIT: To avoid problems with HTML comments you can do the following:

content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");

edited Nov 9, 2009 at 12:40

answered Nov 9, 2009 at 7:29

Andrey Adamovich

20.7k16 gold badges97 silver badges133 bronze badges

2 Comments

Bart Kiers Over a year ago

Since you do not use any of the meat characters ., ^ and $, the s- and m flags can be omitted.

Stephen C Over a year ago

This regex is liable to cause mangling if the HTML contains XML comments with embedded '<' or '>' characters.

George G · Accepted Answer · 2016-01-08 09:57:09Z

4

No. Regular expressions can not by definition parse HTML.

You could use a regex to s/<[^>]*\>// or something naive like that but it's going to be insufficient, especially if you're interested in removing the contents of tags.

As another poster said, use an actual HTML parser.

edited Jan 8, 2016 at 9:57

George G

7,72512 gold badges48 silver badges62 bronze badges

answered Nov 9, 2009 at 6:13

Moishe Lettvin

8,4791 gold badge29 silver badges41 bronze badges

Comments

Saeid · Accepted Answer · 2012-06-13 06:09:01Z

2

You don't need any HTML parser. The below code removes all HTML comments:

htmlString = htmlString.replaceAll("(?s)", "");

answered Jun 13, 2012 at 6:09

Saeid

4935 silver badges14 bronze badges

Comments

Kandha · Accepted Answer · 2010-09-03 10:13:08Z

0

you can use this simple code to remove all html tags...

htmlString.replaceAll("\\<.*?\\>", ""))

answered Sep 3, 2010 at 10:13

Kandha

3,68912 gold badges37 silver badges50 bronze badges

3 Comments

jlordo Over a year ago

This will only remove opening tags and leave closing tags unhandled.

jebbie Over a year ago

i never would do a job like that on my own - parsing html into plain-text is really a though job dude..

jmoran Over a year ago

It worked for me but maybe depends on the complexity of the tags, comments, scripts, etc. So, for a complex case maybe a html library should be better.

Collectives™ on Stack Overflow

How to remove HTML tag in Java [duplicate]

6 Answers 6

5 Comments

1 Comment

2 Comments

Comments

Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

5 Comments

1 Comment

2 Comments

Comments

Comments

3 Comments

Linked

Related