145

Possible Duplicate:
Regular expression to remove HTML tags

Is there an expression which will get the value between two HTML tags?

Given this:

<td class="played">0</td>

I am looking for an expression which will return 0, stripping the <td> tags.

5
  • 4
    Is that the string, the whole string and nothing but the string? If so, how about \d+? Commented Jun 27, 2012 at 15:31
  • 4
    I'm using something like this: (?:<style.+?>.+?</style>|<script.+?>.+?</script>|<(?:!|/?[a-zA-Z]+).*?/?>) and replacing with "". Commented Jun 7, 2014 at 2:32
  • If you're reading this question, please read the accepted answer for the duplicate. The top two answers here are both vulnerable to a very simple input. TL;DR: regular expressions are not useful for properly stripping HTML tags. Commented Jan 26, 2018 at 17:59
  • 1
    <[^>]*>[^>]*<[^>]*> if you need to remove the context and HTML tag. example: hello<sub>2</sub> guys will be hello guys Commented Jul 27, 2022 at 2:42
  • This regex <\/?\w[^>]*>|&\w+; requires a proper tag. Example: "3 <5 and 10 > 9" will not be removed and also remove html codes like &nbsp; Commented Oct 13, 2022 at 12:40

3 Answers 3

233

You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.


The following examples are Java, but the regex will be similar -- if not identical -- for other languages.


String target = someString.replaceAll("<[^>]*>", "");

Assuming your non-html does not contain any < or > and that your input string is correctly structured.

If you know they're a specific tag -- for example you know the text contains only <td> tags, you could do something like this:

String target = someString.replaceAll("(?i)<td[^>]*>", "");

Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.

For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.

In a situation where multiple tags are expected, we could do something like:

String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim();

This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.

Sign up to request clarification or add additional context in comments.

11 Comments

The point here is to return match(es). If there would be more mathces in string, you will merge them to one mess string. Example: <div>text</div><p>here</p>. Got it?
You should not downvote me for comments. I didn't downvote you. I can proof it by downvoting you now, if you want to...
The OP said, "I am looking for an expression which will return 0, stripping the <td> tags." The title of the post is "regular expression to remove html tags from a string". I stripped the <td> tags. Nowhere in the OP did s/he mention anything about pattern matching.
@Netsi1964 How does that differ from the solutions already presented in the answer?
@Netsi1964 - Actually my examples are Java and are executed on Strings. I have added a note to the answer indicating this.
|
95

A trivial approach would be to replace

<[^>]*>

with nothing. But depending on how ill-structured your input is that may well fail.

7 Comments

Replacement is not good approach. For more matches you would merge them to one string.
I don't think I get what you're trying to say.
Given <td>First</td><td>Second</td>, using a replaceAll on the pattern in your post would result in FirstSecond.
Ah, yes. Indeed. But given <b>a</b><i>b</i> the result ab would be expected. So it's not something you could trivially decide. Besides, viewing XML in a browser causes the same, collapsing all text nodes together.
Here's a regex which works well even for malformed html - stackoverflow.com/a/51177854/4717533
|
9

You could do it with jsoup http://jsoup.org/

Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(yourText, whitelist);

1 Comment

JSoup is a very cool library, but unless the OP is planning on doing a lot more than just the simple replacement he's described in his original post it's probably a rather heavy weight solution.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.