5

i'm looking for a .NET Framework class that can parse URLs.

Some examples of URL's that require parsing:

  • server:8088
  • server:8088/func1
  • server:8088/func1/SubFunc1
  • http://server
  • http://server/func1
  • http://server/func/SubFunc1
  • http://server:8088
  • http://server:8088/func1
  • http://server:8088/func1/SubFunc1
  • magnet://server
  • magnet://server/func1
  • magnet://server/func/SubFunc1
  • magnet://server:8088
  • magnet://server:8088/func1
  • magnet://server:8088/func1/SubFunc1

The problem is that the Uri and UriBuilder classes do not handle the URLs correctly. For example, they are confused by:

stackoverflow.com:8088

Background on Urls

The format of a Url is:

  foo://example.com:8042/over/there?name=ferret#nose
  \_/   \_________/ \__/\_________/\__________/ \__/
   |         |        |     |           |        |
scheme      host    port   path       query   fragment

In our case, we only care about:

  • Uri.Scheme
  • Uri.Host
  • Uri.Port
  • Uri.Path

Tests

Running some tests, we can check how UriBuilder class handles various Uri's:

                                        Expected  Expected Expected    Expected
//Test URI                               Scheme    Server    Port        Path
//=====================================  ========  ========  ====  ====================
t("server",                              "",       "server", -1,   "");
t("server/func1",                        "",       "server", -1,   "/func1");
t("server/func1/SubFunc1",               "",       "server", -1,   "/func1/SubFunc1");
t("server:8088",                         "",       "server", 8088, "");
t("server:8088/func1",                   "",       "server", 8088, "/func1");
t("server:8088/func1/SubFunc1",          "",       "server", 8088, "/func1/SubFunc1");
t("http://server",                       "http",   "server", -1,   "/func1");
t("http://server/func1",                 "http",   "server", -1,   "/func1");
t("http://server/func/SubFunc1",         "http",   "server", -1,   "/func1/SubFunc1");
t("http://server:8088",                  "http",   "server", 8088, "");
t("http://server:8088/func1",            "http",   "server", 8088, "/func1");
t("http://server:8088/func1/SubFunc1",   "http",   "server", 8088, "/func1/SubFunc1");
t("magnet://server",                     "magnet", "server", -1,   "");
t("magnet://server/func1",               "magnet", "server", -1,   "/func1");
t("magnet://server/func/SubFunc1",       "magnet", "server", -1,   "/func/SubFunc1");
t("magnet://server:8088",                "magnet", "server", 8088, "");
t("magnet://server:8088/func1",          "magnet", "server", 8088, "/func1");
t("magnet://server:8088/func1/SubFunc1", "magnet", "server", 8088, "/func1/SubFunc1");

All but six cases fail to parse correctly:

Url                                  Scheme  Host    Port  Path
===================================  ======  ======  ====  ===============
server                               http    server  80    /
server/func1                         http    server  80    /func1
server/func1/SubFunc1                http    server  80    /func1/SubFunc1
server:8088                          server          -1    8088
server:8088/func1                    server          -1    8088/func1
server:8088/func1/SubFunc1           server          -1    8088/func1/SubFunc1
http://server                        http    server  80    /
http://server/func1                  http    server  80    /func1
http://server/func/SubFunc1          http    server  80    /func1/SubFunc1
http://server:8088                   http    server  8088  /
http://server:8088/func1             http    server  8088  /func1
http://server:8088/func1/SubFunc1    http    server  8088  /func1/SubFunc1
magnet://server                      magnet  server  -1    /
magnet://server/func1                magnet  server  -1    /func1
magnet://server/func/SubFunc1        magnet  server  -1    /func/SubFunc1
magnet://server:8088                 magnet  server  8088  /
magnet://server:8088/func1           magnet  server  8088  /func1
magnet://server:8088/func1/SubFunc1  magnet  server  8088  /func1/SubFunc1

i said i wanted a .NET Framework class. i would also accept any code-gum laying around that i can pick up and chew. As long as it satisfies my simplistic test cases.

Bonus Chatter

i was looking at expanding this question, but that question is limited to http only.

i also asked this same question earlier today, but i realize now that i phrased it incorrectly. i incorrectly asked how to "build" a url. When in reality i want to "parse" a user-entered URL. i can't go back and fundamentally change the title now. So i'll ask the same question again, only better, with more clearly stated goals, here.

Bonus Reading

6
  • 3
    Based on what I see in RFC 3986 the scheme portion of URIs is mandatory. What would your app do when a user didn't enter the scheme? Commented Nov 24, 2013 at 2:29
  • @AndyB It would assume one appropriate for the application (e.g. stratum+udp://) For example, when i type a URL into the address bar (stackoverflow.com:8088) it able to parse it, realize a scheme is mising, assume you meant http, and add it. tl;dr: i want to do what Chrome, FireFox, Safari, Internet Explorer, curl, and wget do. Commented Nov 28, 2013 at 20:37
  • Is there something wrong with my answer? :) Commented Jan 10, 2014 at 14:58
  • @Luaan Sorry, i know you put in the work. But i was hoping to find the .NET class that exists for this purpose. i've been doing this long enough to know that i could never write a regex that handles all the cases i encounter in the wild. As you say, it's not perfect. And a search of SO will find a few dozen different expressions that try to do the same thing; with differing levels of success. Commented Jan 10, 2014 at 15:25
  • Yeah, it's pretty much impossible to parse the URL perfectly. After all, if you have an IPv6 address instead of host name, suddenly colon is a valid character in the hostname itself. And handling all the possibilities quickly spirals out of control. And then you include unicode domains, and escaping, and... I expect that each browser does the heuristics a tiny bit differently too. Commented Jan 10, 2014 at 15:46

1 Answer 1

1

Will this regular expression do?

^((?<schema>[a-z]*)://)?(?<host>[^/:]*)?(:(?<port>[0-9]*))?(?<path>/.*)?$

It's not perfect, but it seems to work for your test cases.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.