0

I know there are an infinite number of threads asking this question, but I have not been able to find one that can help me with this.

I am basically trying to parse a list of around 10,000,000 URLs, make sure they are valid per the following criteria and then get the root domain URL. This list contains just about everything you can imagine, including stuff like (and the expected formatted url):

biy.ly/test [VALID] [return - bit.ly]
example.com/apples?test=1&id=4 [VALID] [return - example.com]
host101.wow404.apples.test.com/cert/blah [VALID] [return - test.com]
101.121.44.xxx [**inVALID**] [return false]
localhost/noway [**inVALID**] [return false]
www.awesome.com [VALID] [return - awesome.com]
i am so awesome [**inVALID**] [return false]
http://404.mynewsite.com/visits/page/view/1/ [VALID] [return - mynewsite.com]
www1.151.com/searchresults [VALID] [return - 151.com]

Does any one have any suggestions for this?

5
  • You're not really validating anything with the criteria given. Do you also want to do a WHOIS lookup to see of the domain actually exists? Commented May 3, 2012 at 16:26
  • See [here][1] [1]: stackoverflow.com/questions/206059/php-validation-regex-for-url Commented May 3, 2012 at 16:27
  • 1
    What exactly are you going for? localhost is a valid URL. someverylongdomainnamethatprobablydoesntexist.com also is, but probably doesn't exist. Commented May 3, 2012 at 16:27
  • @yAnTar: Syntax for links in comments is [link text](URL). Commented May 3, 2012 at 16:28
  • "I have not been able to find one that can help me with this." - You have not looked hard enough. Commented May 3, 2012 at 16:48

4 Answers 4

15
^(?:https?://)?(?:[a-z0-9-]+\.)*((?:[a-z0-9-]+\.)[a-z]+)

Explanation

^                # start-of-line
(?:              # begin non-capturing group
  https?         #   "http" or "https"
  ://            #   "://"
)?               # end non-capturing group, make optional
(?:              # start non-capturing group
  [a-z0-9-]+\.   #   a name part (numbers, ASCII letters, dashes) & a dot
)*               # end non-capturing group, match as often as possible
(                # begin group 1 (this will be the domain name)
  (?:            #   start non-capturing group
    [a-z0-9-]+\. #     a name part, same as above
  )              #   end non-capturing group
  [a-z]+         #   the TLD
)                # end group 1 

http://rubular.com/r/g6s9bQpNnC

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for this. Love the explanation.
For readers, keep in mind that urls can have non-ascii characters. This regex won't match http://myurl.com/?utf8=✓ see (rubular.com/r/I4fvV3VHVT). Adding the utf8 parameter is a trick used for forcing utf8 encoding in older browsers, see (programmers.stackexchange.com/questions/168751/…)
@DanatheSane You are absolutely right. In fact, something more well thought-out like Daring Fireball: A Liberal, Accurate Regex Pattern for Matching URLs should be used.
Thanks for the link, comprehensive solutions to this problem seem hard to come by.
2

I would start with the default:

filter_var($inputUrl, FILTER_VALIDATE_URL);

Then add your special cases of things that are not acceptable for further validation. This should simplify a bit.

As for getting the host.

parse_url($inputUrl, PHP_URL_HOST);

2 Comments

@RohitChopra that is absolutely not true. FILTER_VALIDATE_URL validates based on the RFC 2396 specifications for valid URLS. faqs.org/rfcs/rfc2396.html
There are also two optional flags you can use with this validator, FILTER_FLAG_PATH_REQUIRED and FILTER_FLAG_QUERY_REQUIRED.
0

^(([a-zA-Z](\.[a-zA-Z])+)|([0-9]{1,3}(\.[0-9]{1,3}){3})/.*$

edit

In php that would be preg_match ( '^(([a-zA-Z](\.[a-zA-Z])+)|([0-9]{1,3}(\.[0-9]{1,3}){3})/.*$' , $myUrls , $matches)

What you need would be in $matches[1]

1 Comment

Domain names may contain other characters than just latin symbols. This regexp fails even with www1.151.com mentioned in the question
0
$website = test_input($_POST["website"]);
if (!preg_match("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$w$website = test_input($_POST["website"]);
if (!preg_match("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$website))
  {
  $websiteErr = "Invalid URL";
  }ebsite))
  {
  $websiteErr = "Invalid URL";
  }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.