0

I think this is sufficiently different from similar questions to warrant a new one.

I have the following regex to match the beginning hyperlink tags in HTML, including the http(s):// part in order to avoid mailto: links

<a[^>]*?href=[""'](?<href>\\b(https?)://[^\[\]""]+?)[""'][^>]*?>

When I run this through Nregex (with escaping removed) it matches correctly for the following test cases:

<a href="http://www.bbc.co.uk">

<a href="http://bbc.co.uk">

<a href="https://www.bbc.co.uk">

<a href="mailto:[email protected]">

However when I run this in my C# code it fails. Here is the matching code:

public static IEnumerable<string> GetUrls(this string input, string matchPattern)
    {
        var matches = Regex.Matches(input, matchPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
        foreach (Match match in matches)
        {
            yield return match.Groups["href"].Value;
        }
    }

And my tests:

@"<a href=""https://www.bbc.co.uk"">bbc</a>".GetUrls(StringExtensions.HtmlUrlRegexPattern).Count().ShouldEqual(1);

@"<a href=""mailto:[email protected]"">bbc</a>".GetUrls(StringExtensions.HtmlUrlRegexPattern).Count().ShouldEqual(0);

The problem seems to be in the \\b(https?):// part which I added, removing this passes the normal URL test but fails the mailto: test.

Anyone shed any light?

4
  • Have we not done the regex can't parse HTML thing to death yet? You have to use an HTML parser, nothing else will ever guarantee your results. Regex parsing the value of the href attribute is another matter though... Commented Mar 12, 2010 at 15:58
  • Exactly how are you defining matchPattern? Commented Mar 12, 2010 at 16:11
  • @Tim public static string HtmlUrlRegexPattern = @"<a[^>]*?href=[""'](?<href>\\b(https?)://[^\[\]""]+?)[""'][^>]*?>"; Commented Mar 12, 2010 at 16:35
  • OK. You don't need to escape the backslash in an @ string. \b is just fine - what you've written is trying to match a literal backslash and a b. Commented Mar 12, 2010 at 16:50

3 Answers 3

1

Are you writing the regex like this?

@"<a[^>]*?href=[""'](?<href>\\b(https?)://[^\[\]""]+?)[""'][^>]*?>"

If so, you have too many backslashes in the word boundary. Because it's a verbatim string literal, the regex compiler sees two backslashes just like you wrote it, so it thinks you're looking for the literal sequence \b.

But you don't need to use a word boundary there anyway. You're already specifying that the protocol must be immediately preceded by a single- or double-quote, so it can't be preceded by a word character.

Sign up to request clarification or add additional context in comments.

Comments

1

The problem is that your regex is actually looking to match something like <a href="\bhttps://.... If you remove the \\b (which is unnecessary) it should work. Use this instead:

<a[^>]*?href=[""'](?<href>(https?)://[^\[\]""]+?)[""'][^>]*?>

Comments

0

As general advice, when dealing with regular expressions, you need to break them down into constituent pieces and get each piece to work correctly. Then, you can focus on assembling them together to match your input. Sometimes this can be hard to do - particularly with complex expressions involving trackback or lookahead, but your case is simple enough that you should be able to decompose the expression into parts that work individually.

I think this should work:

@"(https?):[/][/][^\[\]""]+?)[""'][^>]*?"

You don't need to escape / symbols in regular expressions, but it doesn't hurt to wrap them in a [ ] groups selector.

1 Comment

Your last sentence is not correct: https? will match http or https. What you're referring to would be (https)?.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.