2

I had this url regex pattern in place:

$pattern = "@\b(https?://[^\s()<>\[\]\{\}]{1,".$max_length_allowed_for_each_url."}(?:\([\w\d]+\)|([^[:punct:]\s]|/)))@";

It seemed to work pretty well at validating any URL I threw at it, until I realized that https://http://google.com (apparently even stackoverflow is considering that a valid URL (it made that URL clickable, not me, although it did remove one of the colons) so perhaps I am out of luck?) was a valid URL, when it certainly is not.

I did a little research... and learnt that I should be using filter_var instead of a regex for PHP URL validation anyways... and was disappointed to realize that it too is susceptible to this very same validation problem.

I could easily conquer it with:

str_replace(array("https://http://","http://https://"), array("http://","https://"), $url);

But... that just seems so wrong.

1 Answer 1

2

Well, it is a valid URI. Technically. Look at the RFC for URIs if you don't believe me.

  • The path component of a URI can contain //.
  • http is a valid host name.
  • The port is allowed to be missing even if the : is present (it's specified as *digit, not 1*digit). (This is why Stack Overflow removed the colon -- it thought you were using the default port, so it removed it from the URI.)

I suggest writing a special case for this. In a separate step, check to see if the URI starts with https?://https?://, and fix it.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.