3

I'm parsing a file of URL to get the host and URI part but there is a bug when the URL is not finished with a slash.

C# code :

var URL = Regex.Match(link, @"(?:.*?//)?(.*?)(/.*)", RegexOptions.IgnoreCase);

Input :

//cdn.sstatic.net/stackoverflow/img/favicon.ico
/opensearch.xml
http://stackoverflow.com/
http://careers.stackoverflow.com

Output :

//cdn.sstatic.net/stackoverflow/img/favicon.ico has 2 groups:
    cdn.sstatic.net
    /stackoverflow/img/favicon.ico

/opensearch.xml has 2 groups:

    /opensearch.xml

http://stackoverflow.com/ has 2 groups:
    stackoverflow.com
    /
http://careers.stackoverflow.com has 2 groups:
    http:
    //careers.stackoverflow.com

Every URL in the output is valid exept for : http://careers.stackoverflow.com. How can I check for a variable part like "if there is a slash, stop to the first one orelse grab everythings".

3 Answers 3

1

Add |$ to your last group, to match that text or match the end of the expression.

This works for your inputs:

var links = new[]
    {
        "//cdn.sstatic.net/stackoverflow/img/favicon.ico",
        "/opensearch.xml",
        "http://stackoverflow.com/",
        "http://careers.stackoverflow.com"
    };

foreach (string link in links)
{
    var u = Regex.Match(link, @"(?:.*?//)?(.*?)(/.*|$)", RegexOptions.IgnoreCase);
    Console.WriteLine(link);
    Console.WriteLine("    " + u.Groups[1]);
    Console.WriteLine("    " + u.Groups[2]);
    Console.WriteLine();
}

Output:

//cdn.sstatic.net/stackoverflow/img/favicon.ico
    cdn.sstatic.net
    /stackoverflow/img/favicon.ico

/opensearch.xml

    /opensearch.xml

http://stackoverflow.com/
    stackoverflow.com
    /

http://careers.stackoverflow.com
    careers.stackoverflow.com
Sign up to request clarification or add additional context in comments.

Comments

1

Just another option

/(?:.+\/\/|\/\/)?([^\/]*)(\/.+)?/

Comments

-1

usr is right that you should use the Uri class but if you insist on using Regex, try using the zero-width positive lookahead assertion like this:

var URL = Regex.Match(link, @"(?:.*?//)?(.*?(?=/|$))(/.*)", RegexOptions.IgnoreCase);

More details at:

http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#zerowidth_positive_lookahead_assertion

2 Comments

Uri class won't work. These are not valid Uris. Generates "System.UriFormatException: Invalid URI: The format of the URI could not be determined."
The regex doesn't work either. Still produces http: for group 1 and //careers.stackoverflow.com for group 2.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.