1

I'm developing a PHP-based web-application in which you have a form with textarea inputs that can accept links via anchor tags. But when I tested it after adding a hyperlink as follows, it pointed to a non-existent local subdirectory:
<a href="www.link.com">link</a>
I realized that this was because I had not appended http:// before the link.

there might be cases where a user might input the link just as I did above. In such cases I don't want the link to be pointing as it did above. is there any possible solution, such as automatically appending http:// before the link in case that it doesn't exist? How do I do that?
----------------------------------------Edit---------------------------------------------
Please consider that the anchor tags are amidst other plaintext and this is making things harder to work with.

3
  • If you're only interested in links contained within A tags then this might actually make your life easier, at least as far as detection goes. You can use the DOMDocument extension (which has been part of PHP by default for a while) to grab the A tags and examine their attributes, including href. The normalisation process is still going to be problematic though. Commented Feb 17, 2011 at 8:10
  • @gordon Can you please brief it a bit further? That would be a great help. Thanks. :) Commented Feb 17, 2011 at 15:37
  • I've not made much use of the domdocument extension, but it would involve using uk3.php.net/manual/en/domdocument.getelementsbytagname.php to grab all the A tags. I'm afraid the rest is up to you. Commented Feb 18, 2011 at 7:20

2 Answers 2

5

I'd go for something like this:

if (!parse_url($url, PHP_URL_SCHEME)) {
    $url = 'http://' . $url;
}

This is an easy and stable way to check for the presence of a protocol in a URL, and allows others (e.g. ftp, https) that may be entered.

Sign up to request clarification or add additional context in comments.

1 Comment

The anchor tag will be amidst other text in the text area. Where and how shall I use the code?
1

What you're talking about involves two steps, URL detection and URL normalization. First you'll have to detect all the URLs in the string being parsed and store them in a data structure for further processing, such as an array. Then you need to iterate over the array and normalize each URL in turn, before attempting to store them.

Unfortunately, both detection and normalization can be problematic, as a URL has a quite complicated structure. http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/ makes some suggestions, but as the page itself says, no regex URL detection is ever perfect.

There are examples of regular expressions that can detect URLs available from various sites, but in my experience none of them are completely reliable.

As for normalization, Wikipedia has an article on the subject which may be a good starting point. http://en.wikipedia.org/wiki/URL_normalization

2 Comments

if none of the URL Optimization techniques are perfect in themselves then what am I expected to do?
You have two choices, either be a lot stricter about what you'll recognise as a valid URL (reduces usability) or try to do what you're doing and attempt to munge invalid urls into something that work (increasing the risk of invalid data in your system). Neither approach is adeal but I would tend to favour the former, as it's less of a risk regarding the data that might end up in your system.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.