Link unlinked urls (BBCode) regex

Question

I need a regex that looks for any URL that isn't already inside [url(=...)]...[/url] tags. In other words, I want to link any URL that isn't linked and replace the link with [url]link[/url] so that the parser I'm using can take care of it as it usually would.

I've been trying to get an understanding of negative lookaheads (which is apparently what I should make use of), but I just can't get it down.

This is what I've got so far:

preg_replace('/(?!\[url(=.*?)?\])(https?|ftps?|irc):\/\/(www\.)?(\w+(:\w+)?@)?[a-z0-9-]+(\.[a-z0-9-])*.*(?!\[\/url\])/i',"[url]$0[/url]",$Str);

Thanks

You may also want to verify that the URL is not inside an [img] tag if your BBCode parser allows those. — ridgerunner
– ridgerunner, Commented Oct 9, 2011 at 3:26
Actually, I do not parse img tags at all on my site, so it's all good. — user966939
– user966939, Commented Oct 9, 2011 at 5:16

user966939 · Accepted Answer · 2015-05-08 23:13:38Z

3

My solution:

<?php
$URLRegex = '/(?:(?<!(\[\/url\]|\[\/url=))(\s|^))';     // No [url]-tag in front and is start of string, or has whitespace in front
$URLRegex.= '(';                                        // Start capturing URL
$URLRegex.= '(https?|ftps?|ircs?):\/\/';                // Protocol
$URLRegex.= '\S+';                                      // Any non-space character
$URLRegex.= ')';                                        // Stop capturing URL
$URLRegex.= '(?:(?<![[:punct:]])|(?<=\/))(\s|\.?$)/i';  // Doesn't end with punctuation (excluding /) and is end of string (with a possible dot at the end), or has whitespace after

$Str = preg_replace($URLRegex,"$2[url]$3[/url]$5",$Str);
?>

edited May 8, 2015 at 23:13

answered Oct 9, 2011 at 3:17

user966939

7579 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user966939 Over a year ago

Also allows a dot after the URL if it's at the end of the string (meaning, it will not be a part of the link).

Jeff Widman Over a year ago

Can you edit your answer to also match on urls that end in '/'? Not matching on urls that end in punctation is great, except that ending in '/' is almost always part of the url. I tried modifying [:punct:] using character class subtraction, but unfortunately that's not supported in PCRE.

user149341 · Accepted Answer · 2011-10-08 06:33:49Z

1

There's an excellent URL-matching regular expression here:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

answered Oct 8, 2011 at 6:33

user149341

1 Comment

user966939 Over a year ago

I tried the regex, but it catches [ and ] as well, so this doesn't work for me.

ridgerunner · Accepted Answer · 2011-10-08 14:53:35Z

Linkifying unlinked URLs is not trivial. There are a lot of gotchas (See: The Problem with URLs) and the thread of comments following this blog entry. The problem is compounded when you have URLs which are already linked that you wish to skip over. I have looked into this problem and have been working on a solution - an open source project: LinkifyURL. Here is the most recent incarnation of a function which does what you are asking. Note that the regex is NOT trivial (but neither is the problem as it turns out).

function linkify($text) {
    $url_pattern = '/# Rev:20100913_0900 github.com\/jmrware\/LinkifyURL
    # Match http & ftp URL that is not already linkified.
      # Alternative 1: URL delimited by (parentheses).
      (\()                     # $1  "(" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $2: URL.
      (\))                     # $3: ")" end delimiter.
    | # Alternative 2: URL delimited by [square brackets].
      (\[)                     # $4: "[" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $5: URL.
      (\])                     # $6: "]" end delimiter.
    | # Alternative 3: URL delimited by {curly braces}.
      (\{)                     # $7: "{" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $8: URL.
      (\})                     # $9: "}" end delimiter.
    | # Alternative 4: URL delimited by <angle brackets>.
      (<|&(?:lt|\#60|\#x3c);)  # $10: "<" start delimiter (or HTML entity).
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $11: URL.
      (>|&(?:gt|\#62|\#x3e);)  # $12: ">" end delimiter (or HTML entity).
    | # Alternative 5: URL not delimited by (), [], {} or <>.
      (                        # $13: Prefix proving URL not already linked.
        (?: ^                  # Can be a beginning of line or string, or
        | [^=\s\'"\]]          # a non-"=", non-quote, non-"]", followed by
        ) \s*[\'"]?            # optional whitespace and optional quote;
      | [^=\s]\s+              # or... a non-equals sign followed by whitespace.
      )                        # End $13. Non-prelinkified-proof prefix.
      ( \b                     # $14: Other non-delimited URL.
        (?:ht|f)tps?:\/\/      # Required literal http, https, ftp or ftps prefix.
        [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]+ # All URI chars except "&" (normal*).
        (?:                    # Either on a "&" or at the end of URI.
          (?!                  # Allow a "&" char only if not start of an...
            &(?:gt|\#0*62|\#x0*3e);                  # HTML ">" entity, or
          | &(?:amp|apos|quot|\#0*3[49]|\#x0*2[27]); # a [&\'"] entity if
            [.!&\',:?;]?        # followed by optional punctuation then
            (?:[^a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]|$)  # a non-URI char or EOS.
          ) &                  # If neg-assertion true, match "&" (special).
          [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]* # More non-& URI chars (normal*).
        )*                     # Unroll-the-loop (special normal*)*.
        [a-z0-9\-_~$()*+=\/#[\]@%]  # Last char can\'t be [.!&\',;:?]
      )                        # End $14. Other non-delimited URL.
    /imx';
    $url_replace = '$1$4$7$10$13<a href="$2$5$8$11$14">$2$5$8$11$14</a>$3$6$9$12';
    return preg_replace($url_pattern, $url_replace, $text);
}

This solution does have some limitations, and recently I have been working on an improved version (which is simpler and works better) - but it is not yet ready for prime-time.

Be sure to take a look at the linkify test page where I have put together a list of really-hard-to-match-in-the-wild URLs.

@Jeff Widman - Sorry, but I never finished the newer version - this one here is still the best one I've got for the moment.

Collectives™ on Stack Overflow

Link unlinked urls (BBCode) regex

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related