1

I have a HTML string and want to replace all links to just a text.

E.g. having

Some text <a href="http://google.com/">Google</a>.

need to get

Some text Google.

What regex should I use?

5
  • 4
    Generally speaking (and probably true in this case), you should not use regex to "parse" HTML and work on it ; instead, you should use some tool to manipulate your HTML document via the DOM. Commented Mar 13, 2010 at 12:18
  • "How do I parse HTML with a regex" is probably in the top 10 of asked questions on SO. The answer is: You don't Commented Mar 13, 2010 at 12:29
  • 1
    It contains the top voted answer that's for sure! - stackoverflow.com/questions/1732348/… Commented Mar 13, 2010 at 12:31
  • The task does look simple at first sight but there are plenty of potential issues that can come out and bite you. Handling the correct, simple case is quite easy but experience tells me there will be plenty of incorrect HTML merrily thrown at your code when you're on holiday or on your next project, and you are usually expected to have written code to handle many oddities. Regexes (well most likely not a single one but a lot of different ones, together with some procedureal code) can do this but handling the bum cases is hard and loads of people have worked hard on this already elsewhere. Commented Mar 13, 2010 at 12:35
  • Sometimes there is a need for just the basics, where the input format or HTML formatting quality is known. I needed this to strip off some unwanted content before creating a PDF and it worked fine. Commented Nov 25, 2013 at 22:14

3 Answers 3

2

Several similar questions have been posted and the best practice is to use Html Agility Pack which is built specifically to achieve thing like this.

http://www.codeplex.com/htmlagilitypack

Sign up to request clarification or add additional context in comments.

2 Comments

In second note, if you really need a regex solution, you can do this \<a href=.*?\>(?<text>.*?)\</a\> to extract the text and replace using the same regex string pattern, or simply replace \<a href=.*?\> and \</a\> with empty string
+1 this answer. <a href=.*?>... will fail even for simple, valid HTML. Allowing .*? is naïve even by the low, low standards of regex; for example a simple difference like the close-tag being </a > and you've just matched a big stretch of document across multiple links by mistake. Plus, of course, the hundred other constructs that'll trip this over. Do yourself a favour. Use an HTML parser. It's what they're there for.
1
var html = "<a ....>some text</a>";
var ripper = new Regex("<a.*?>(?<anchortext>.*?)</a>", RegexOptions.IgnoreCase);
html = ripper.Match(html).Groups["anchortext"].Value;
//html = "some text"

Comments

1

I asked about simple regex (thanks Fabrian). The code will be the following:

var html = @"Some text <a href="http://google.com/">Google</a>.";
Regex r = new Regex(@"\<a href=.*?\>");
html = r.Replace(html, "");
r = new Regex(@"\</a\>");
html = r.Replace(html, "");

2 Comments

Welcome. So I take it that this is what you wanted then? If you please accept the answer so not wasting other time to post more answers
this doesn't handle the case where the tag has a different attribute (i.e. title) before href. See my answer below.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.