Hyperlink regex including http(s):// not working in C#

Question

I think this is sufficiently different from similar questions to warrant a new one.

I have the following regex to match the beginning hyperlink tags in HTML, including the http(s):// part in order to avoid mailto: links

<a[^>]*?href=[""'](?<href>\\b(https?)://[^\[\]""]+?)[""'][^>]*?>

When I run this through Nregex (with escaping removed) it matches correctly for the following test cases:

<a href="http://www.bbc.co.uk">

<a href="http://bbc.co.uk">

<a href="https://www.bbc.co.uk">

<a href="mailto:[email protected]">

However when I run this in my C# code it fails. Here is the matching code:

public static IEnumerable<string> GetUrls(this string input, string matchPattern)
    {
        var matches = Regex.Matches(input, matchPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
        foreach (Match match in matches)
        {
            yield return match.Groups["href"].Value;
        }
    }

And my tests:

@"<a href=""https://www.bbc.co.uk"">bbc</a>".GetUrls(StringExtensions.HtmlUrlRegexPattern).Count().ShouldEqual(1);

@"<a href=""mailto:[email protected]"">bbc</a>".GetUrls(StringExtensions.HtmlUrlRegexPattern).Count().ShouldEqual(0);

The problem seems to be in the \\b(https?):// part which I added, removing this passes the normal URL test but fails the mailto: test.

Anyone shed any light?

Have we not done the regex can't parse HTML thing to death yet? You have to use an HTML parser, nothing else will ever guarantee your results. Regex parsing the value of the href attribute is another matter though... — annakata
– annakata, Commented Mar 12, 2010 at 15:58
@Tim public static string HtmlUrlRegexPattern = @"<a[^>]*?href=[""'](?<href>\\b(https?)://[^\[\]""]+?)[""'][^>]*?>"; — roryf
– roryf, Commented Mar 12, 2010 at 16:35
OK. You don't need to escape the backslash in an @ string. \b is just fine - what you've written is trying to match a literal backslash and a b. — Tim Pietzcker
– Tim Pietzcker, Commented Mar 12, 2010 at 16:50

Alan Moore · Accepted Answer · 2010-03-12 16:49:48Z

1

Are you writing the regex like this?

@"<a[^>]*?href=[""'](?<href>\\b(https?)://[^\[\]""]+?)[""'][^>]*?>"

If so, you have too many backslashes in the word boundary. Because it's a verbatim string literal, the regex compiler sees two backslashes just like you wrote it, so it thinks you're looking for the literal sequence \b.

But you don't need to use a word boundary there anyway. You're already specifying that the protocol must be immediately preceded by a single- or double-quote, so it can't be preceded by a word character.

edited Mar 12, 2010 at 16:49

answered Mar 12, 2010 at 16:44

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Gabe · Accepted Answer · 2010-03-12 16:55:42Z

1

The problem is that your regex is actually looking to match something like <a href="\bhttps://.... If you remove the \\b (which is unnecessary) it should work. Use this instead:

<a[^>]*?href=[""'](?<href>(https?)://[^\[\]""]+?)[""'][^>]*?>

answered Mar 12, 2010 at 16:55

Gabe

87.1k13 gold badges144 silver badges238 bronze badges

Comments

LBushkin · Accepted Answer · 2010-03-12 16:20:53Z

0

As general advice, when dealing with regular expressions, you need to break them down into constituent pieces and get each piece to work correctly. Then, you can focus on assembling them together to match your input. Sometimes this can be hard to do - particularly with complex expressions involving trackback or lookahead, but your case is simple enough that you should be able to decompose the expression into parts that work individually.

I think this should work:

@"(https?):[/][/][^\[\]""]+?)[""'][^>]*?"

You don't need to escape / symbols in regular expressions, but it doesn't hurt to wrap them in a [ ] groups selector.

edited Mar 12, 2010 at 16:20

answered Mar 12, 2010 at 15:47

LBushkin

132k33 gold badges218 silver badges265 bronze badges

1 Comment

Tim Pietzcker Over a year ago

Your last sentence is not correct: https? will match http or https. What you're referring to would be (https)?.

Collectives™ on Stack Overflow

Hyperlink regex including http(s):// not working in C#

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related