18

I'm trying to find a reliable solution to extract a url from a string of characters. I have a site where users answer questions and in the source box, where they enter their source of information, I allow them to enter a url. I want to extract that url and make it a hyperlink. Similar to how Yahoo Answers does it.

Does anyone know a reliable solution that can do this?

All the solutions I have found work for some URL's but not for others.

Thanks

5 Answers 5

23

John Gruber has spent a fair amount of time perfecting the "one regex to rule them all" for link detection. Using preg_replace() as mentioned in the other answers, using the following regex should be one of the most accurate, if not the most accurate, method for detecting a link:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

If you only wanted to match HTTP/HTTPS:

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
Sign up to request clarification or add additional context in comments.

4 Comments

For anyone who wants all the sub-patterns converted to be non capturing, and the forward slashes escaped: \b(?:(?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|((?:[^\s()<>]+|(?:([^\s()<>]+)))*))+(?:((?:[^\s()<>]+|(?:([^\s()<>]+)))*)|[^\s`!()[]{};:'".,<>?«»“”‘’]))
TLDs may have much more than 4 characters, see: iana.org/domains/root/db
And how do we use this regex within preg? I mean, because it has " and ' the code doesn't work properly, like: preg_match('(?i)\b......]))', $str) - all code seems like it is commented.
Not working. Preg_match & preg_match_all failing everytime, even after removing single/double quotes
3
$string = preg_replace('/https?:\/\/[^\s"<>]+/', '<a href="$0" target="_blank">$0</a>', $string);

It only matches http/https, but that's really the only protocol you want to turn into a link. If you want others, you can change it like this:

$string = preg_replace('/(https?|ssh|ftp):\/\/[^\s"]+/', '<a href="$0" target="_blank">$0</a>', $string);

3 Comments

You might also want to exclude < or apply htmlspecialchars on the matched string to avoid code injection.
Good, but if you look at the expression, it allows anything but white-space and ". I believe that eliminates any HTML injection.
Bron: No, you are using the matched value not just as attribute value but also as the elements text content.
2

There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.

I created a PHP library that could deal with lots of edge cases: Url highlight.

You could extract urls from string or directly highlight them.
Example:

<?php

use VStelmakh\UrlHighlight\UrlHighlight;

$urlHighlight = new UrlHighlight();

// Extract urls
$urlHighlight->getUrls("This is example http://example.com.");
// return: ['http://example.com']

// Make urls as hyperlinks
$urlHighlight->highlightUrls('Hello, http://example.com.');
// return: 'Hello, <a href="http://example.com">http://example.com</a>.'

For more details see readme. For covered url cases see test.

Comments

0

Yahoo! Answers does a fairly good job of link identification when the link is written properly and separate from other text, but it isn't very good at separating trailing punctuation. For example The links are http://example.com/somepage.php, http://example.com/somepage2.php, and http://example.com/somepage3.php. will include commas on the first two and a period on the third.

But if that is acceptable, then patterns like this should do it:

\<http:[^ ]+\>

It looks like stackoverflow's parser is better. Is is open source?

1 Comment

smarter, but still not perfect. misses things like ssh+svn.
-1

This code is worked for me.

function makeLink($string){

/*** make sure there is an http:// on all URLs ***/
$string = preg_replace("/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i", "$1http://$2",$string);
/*** make all URLs links ***/
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\@]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</a>",$string);
/*** make all emails hot links ***/
$string = preg_replace("/([\w-?&;#~=\.\/]+\@(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","<a href=\"mailto:$1\">$1</a>",$string);

return $string;
}

1 Comment

Why are you limiting tld to 3 characters? Have a look at: iana.org/domains/root/db

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.