1

I have a string that contains html. I want to get all href value from hyperlinks using C#.
Target String
<a href="~/abc/cde" rel="new">Link1</a>
<a href="~/abc/ghq">Link2</a>

I want to get values "~/abc/cde" and "~/abc/ghq"

5
  • 3
    obligatory reference :) Commented Apr 12, 2010 at 16:54
  • 1
    @balpha: What? That absolutely does not apply here. You can use regex to get the href of an open tag and not even bother with closing tags. Commented Apr 12, 2010 at 17:01
  • @Platinum: en.wikipedia.org/wiki/Emoticon Commented Apr 13, 2010 at 5:19
  • @balpha: Well, I'm glad you have a sense of humor, but given how it has also appeared in EVERY answer below, you can understand why I might think people just have this knee-jerk "omg never use regex to parse HTML" response, emoticon or no. Commented Apr 13, 2010 at 18:36
  • @Platinum Azure: No harm -- I just love to mention that answer, because if you've read it once, it will stick in your head and haunt you whenever you start markup parsing with regexes. That doesn't mean it's always wrong, but having that answer in your head makes you at least think about it. I sometimes analyze HTML without a real parser, too, but I usually put a comment # the center cannot hold before it :) Commented Apr 14, 2010 at 15:25

3 Answers 3

4

Use the HTML Agility Pack for parsing HTML. Right on their examples page they have an example of parsing some HTML for the href values:

 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
    HtmlAttribute att = link["href"];

    // Do stuff with attribute value
 }
Sign up to request clarification or add additional context in comments.

Comments

2

Using a regex to parse HTML is not advisable (think of text in comments etc.).

That said, the following regex should do the trick, and also gives you the link HTML in the tag if desired:

Regex regex = new Regex(@"\<a\s[^\<\>]*?href=(?<quote>['""])(?<href>((?!\k<quote>).)*)\k<quote>[^\>]*\>(?<linkHtml>((?!\</a\s*\>).)*)\</a\s*\>", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture);
for (Match match = regex.Match(inputHtml); match.Success; match=match.NextMatch()) {
  Console.WriteLine(match.Groups["href"]);
}

3 Comments

Thats exactly what i was looking for, how the groups thing is working?
I am trying same thing for img src but its not working, any idea? Regex srcs = new Regex(@"\<img\s[^\<\>]*?src=(?<quote>['""])(?<src>((?!\k<quote>).)*)\k<quote>[^\>]*\>(?<linkHtml>((?!\</img\s*\>).)*)\</img\s*\>", RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
The img tag is an empty tag, so you have no contents. Try this: \<img\s[^\<\>]*?src=(?<quote>['""])(?<src>((?!\k<quote>).)*)\k<quote>[^\>]*\>
1

Here is a snippet of the regex (use IgnoreWhitespace option):

(?:<)(?<Tag>[^\s/>]+)       # Extract the tag name.
(?![/>])                    # Stop if /> is found
# -- Extract Attributes Key Value Pairs  --

((?:\s+)             # One to many spaces start the attribute
 (?<Key>[^=]+)       # Name/key of the attribute
 (?:=)               # Equals sign needs to be matched, but not captured.

(?([\x22\x27])              # If quotes are found
  (?:[\x22\x27])
  (?<Value>[^\x22\x27]+)    # Place the value into named Capture
  (?:[\x22\x27])
 |                          # Else no quotes
   (?<Value>[^\s/>]*)       # Place the value into named Capture
 )
)+                  # -- One to many attributes found!

This will give you every tag and you can filter out what is needed and target the attribute you want.

I've written more about this in my blog (C# Regex Linq: Extract an Html Node with Attributes of Varying Types).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.