1

I have this Regex which I'm working on

string addressstart = Regex.Escape("<a href=\"/url?q=");
                string addressend = Regex.Escape("&amp");
                string regAdd = addressstart + @"(.*?)" + addressend;

I'd like it to give me the url from this html

<a href="/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw">

so it should return "https://www.google.com/"

Any ideas Why it isnt working? thanks!

1
  • Does my answer below help? Commented Mar 13, 2017 at 21:40

5 Answers 5

2

The following regex worked for me. Make sure that you select group 1, since group 0 is always the full string.

@"<a href=\"\/url\?q=(.*?)&amp"
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! When I try to use this though I get errors because of the quotation marks. Any Idea why this could be?
1

As it appear you are looking for the url of google as part of your string. You might find useful the following pattern which will match it:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}

It is to be noted this is a small tweak of the general regex found at: What is a good regular expression to match a URL?

Edit Please see the code below in order to apply this regex and find the value you are looking for:

string input = "<a href=\"/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw\">";
var regex = new Regex(@"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}");
var output = regex.Match(input).Value; // https://www.google.com

1 Comment

If you are only matching the full thing then doing https?:\/\/www\.[-a-zA-Z0-9@:%._\+~#=]{2,256} and choosing the group 0, will work too.
1

The problem is in the "<a href=\"/url?q=" part of the regular expression. The ? is not escaped. It means an optional l. Hence that part of the regular expresion matches either <a href="/urlq= or <a href="/urq=. Neither include the ? character.

Comments

0

When parsing HTML, you should consider using some HTML parser, like HtmlAgilityPack, and only after getting the necessary node, apply the regex on the plain text.

If you want to debug your own code, here is a fix:

using System;
using System.Text.RegularExpressions;

public class Test
{
    public static void Main()
    {
        var s = "<a href=\"/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw\">";
        var pattern = @"<a href=""/url\?q=(.*?)&amp;";
        var result = Regex.Match(s, pattern);
        if (result.Success)
            Console.WriteLine(result.Groups[1].Value);
    }
}

See a DotNetFiddle demo.

Here is an example how how you may extract all <a> href attribute values that start with /url?q= with HtmlAgilityPack. Install it via Solution > Manage NuGet Packages for Solution... and use

public List<string> HapGetHrefs(string html)
{
    var hrefs = new List<string>();
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes("//a[starts-with(@href, '/url?q=')]");
    if (nodes != null)
    {
       foreach (var node in nodes)
       {
           foreach (var attribute in node.Attributes)
               if (attribute.Name == "href")
               {
                   hrefs.Add(attribute.Value);
               }
        }
    }
    return hrefs;
 }

Then, all you need is apply a simpler regex or a couple of simpler string operations.

Comments

0

You can use:

(?<=a href="\/url\?q=)[^&]+

1 Comment

What are the benefits to this approach over the accepted answer from two years ago?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.