Regular expression to remove HTML tags from a string [duplicate]

Question

Possible Duplicate:
Regular expression to remove HTML tags

Is there an expression which will get the value between two HTML tags?

Given this:

<td class="played">0</td>

I am looking for an expression which will return 0, stripping the <td> tags.

Is that the string, the whole string and nothing but the string? If so, how about \d+? — Ry-
– Ry- ♦, Commented Jun 27, 2012 at 15:31
I'm using something like this: (?:<style.+?>.+?</style>|<script.+?>.+?</script>|<(?:!|/?[a-zA-Z]+).*?/?>) and replacing with "". — Josh M.
– Josh M., Commented Jun 7, 2014 at 2:32
If you're reading this question, please read the accepted answer for the duplicate. The top two answers here are both vulnerable to a very simple input. TL;DR: regular expressions are not useful for properly stripping HTML tags. — Claudia
– Claudia, Commented Jan 26, 2018 at 17:59
<[^>]*>[^>]*<[^>]*> if you need to remove the context and HTML tag. example: hello<sub>2</sub> guys will be hello guys — MBK
– MBK, Commented Jul 27, 2022 at 2:42
This regex <\/?\w[^>]*>|&\w+; requires a proper tag. Example: "3 <5 and 10 > 9" will not be removed and also remove html codes like — Evandro Jr
– Evandro Jr, Commented Oct 13, 2022 at 12:40

Roddy of the Frozen Peas · Accepted Answer · 2018-01-26 18:25:20Z

233

You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.

The following examples are Java, but the regex will be similar -- if not identical -- for other languages.

String target = someString.replaceAll("<[^>]*>", "");

Assuming your non-html does not contain any < or > and that your input string is correctly structured.

If you know they're a specific tag -- for example you know the text contains only <td> tags, you could do something like this:

String target = someString.replaceAll("(?i)<td[^>]*>", "");

Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.

For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.

In a situation where multiple tags are expected, we could do something like:

String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim();

This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.

edited Jan 26, 2018 at 18:25

answered Jun 27, 2012 at 15:42

Roddy of the Frozen Peas

15.3k11 gold badges63 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Ωmega Over a year ago

The point here is to return match(es). If there would be more mathces in string, you will merge them to one mess string. Example: <div>text</div><p>here</p>. Got it?

Ωmega Over a year ago

You should not downvote me for comments. I didn't downvote you. I can proof it by downvoting you now, if you want to...

Roddy of the Frozen Peas Over a year ago

The OP said, "I am looking for an expression which will return 0, stripping the <td> tags." The title of the post is "regular expression to remove html tags from a string". I stripped the <td> tags. Nowhere in the OP did s/he mention anything about pattern matching.

Roddy of the Frozen Peas Over a year ago

@Netsi1964 How does that differ from the solutions already presented in the answer?

Roddy of the Frozen Peas Over a year ago

@Netsi1964 - Actually my examples are Java and are executed on Strings. I have added a note to the answer indicating this.

|

Joey · Accepted Answer · 2012-06-27 15:31:39Z

95

A trivial approach would be to replace

<[^>]*>

with nothing. But depending on how ill-structured your input is that may well fail.

answered Jun 27, 2012 at 15:31

Joey

357k88 gold badges704 silver badges699 bronze badges

7 Comments

Ωmega Over a year ago

Replacement is not good approach. For more matches you would merge them to one string.

Joey Over a year ago

I don't think I get what you're trying to say.

Roddy of the Frozen Peas Over a year ago

Given <td>First</td><td>Second</td>, using a replaceAll on the pattern in your post would result in FirstSecond.

Joey Over a year ago

Ah, yes. Indeed. But given <b>a</b><i>b</i> the result ab would be expected. So it's not something you could trivially decide. Besides, viewing XML in a browser causes the same, collapsing all text nodes together.

Niket Pathak Over a year ago

Here's a regex which works well even for malformed html - stackoverflow.com/a/51177854/4717533

|

mihaisimi · Accepted Answer · 2012-06-27 15:34:05Z

9

You could do it with jsoup http://jsoup.org/

Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(yourText, whitelist);

answered Jun 27, 2012 at 15:34

mihaisimi

2,03914 silver badges15 bronze badges

1 Comment

Roddy of the Frozen Peas Over a year ago

JSoup is a very cool library, but unless the OP is planning on doing a lot more than just the simple replacement he's described in his original post it's probably a rather heavy weight solution.

Collectives™ on Stack Overflow

Regular expression to remove HTML tags from a string [duplicate]

3 Answers 3

11 Comments

7 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

11 Comments

7 Comments

1 Comment

Linked

Related