C# Regex bug with URL

Question

I'm parsing a file of URL to get the host and URI part but there is a bug when the URL is not finished with a slash.

C# code :

var URL = Regex.Match(link, @"(?:.*?//)?(.*?)(/.*)", RegexOptions.IgnoreCase);

Input :

//cdn.sstatic.net/stackoverflow/img/favicon.ico
/opensearch.xml
http://stackoverflow.com/
http://careers.stackoverflow.com

Output :

//cdn.sstatic.net/stackoverflow/img/favicon.ico has 2 groups:
    cdn.sstatic.net
    /stackoverflow/img/favicon.ico

/opensearch.xml has 2 groups:

    /opensearch.xml

http://stackoverflow.com/ has 2 groups:
    stackoverflow.com
    /
http://careers.stackoverflow.com has 2 groups:
    http:
    //careers.stackoverflow.com

Every URL in the output is valid exept for : http://careers.stackoverflow.com. How can I check for a variable part like "if there is a slash, stop to the first one orelse grab everythings".

Samuel Neff · Accepted Answer · 2013-10-27 17:20:18Z

1

Add |$ to your last group, to match that text or match the end of the expression.

This works for your inputs:

var links = new[]
    {
        "//cdn.sstatic.net/stackoverflow/img/favicon.ico",
        "/opensearch.xml",
        "http://stackoverflow.com/",
        "http://careers.stackoverflow.com"
    };

foreach (string link in links)
{
    var u = Regex.Match(link, @"(?:.*?//)?(.*?)(/.*|$)", RegexOptions.IgnoreCase);
    Console.WriteLine(link);
    Console.WriteLine("    " + u.Groups[1]);
    Console.WriteLine("    " + u.Groups[2]);
    Console.WriteLine();
}

Output:

//cdn.sstatic.net/stackoverflow/img/favicon.ico
    cdn.sstatic.net
    /stackoverflow/img/favicon.ico

/opensearch.xml

    /opensearch.xml

http://stackoverflow.com/
    stackoverflow.com
    /

http://careers.stackoverflow.com
    careers.stackoverflow.com

answered Oct 27, 2013 at 17:20

Samuel Neff

75.3k18 gold badges144 silver badges186 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

MC ND · Accepted Answer · 2013-10-27 17:52:55Z

1

Just another option

/(?:.+\/\/|\/\/)?([^\/]*)(\/.+)?/

answered Oct 27, 2013 at 17:52

MC ND

71.1k8 gold badges95 silver badges136 bronze badges

Comments

acfrancis · Accepted Answer · 2013-10-27 17:16:36Z

-1

usr is right that you should use the Uri class but if you insist on using Regex, try using the zero-width positive lookahead assertion like this:

var URL = Regex.Match(link, @"(?:.*?//)?(.*?(?=/|$))(/.*)", RegexOptions.IgnoreCase);

More details at:

http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#zerowidth_positive_lookahead_assertion

answered Oct 27, 2013 at 17:16

acfrancis

3,70128 silver badges22 bronze badges

2 Comments

Samuel Neff Over a year ago

Uri class won't work. These are not valid Uris. Generates "System.UriFormatException: Invalid URI: The format of the URI could not be determined."

Samuel Neff Over a year ago

The regex doesn't work either. Still produces http: for group 1 and //careers.stackoverflow.com for group 2.

Collectives™ on Stack Overflow

C# Regex bug with URL

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related