1

I need a regex that looks for any URL that isn't already inside [url(=...)]...[/url] tags. In other words, I want to link any URL that isn't linked and replace the link with [url]link[/url] so that the parser I'm using can take care of it as it usually would.

I've been trying to get an understanding of negative lookaheads (which is apparently what I should make use of), but I just can't get it down.

This is what I've got so far:

preg_replace('/(?!\[url(=.*?)?\])(https?|ftps?|irc):\/\/(www\.)?(\w+(:\w+)?@)?[a-z0-9-]+(\.[a-z0-9-])*.*(?!\[\/url\])/i',"[url]$0[/url]",$Str);

Thanks

2
  • You may also want to verify that the URL is not inside an [img] tag if your BBCode parser allows those. Commented Oct 9, 2011 at 3:26
  • Actually, I do not parse img tags at all on my site, so it's all good. Commented Oct 9, 2011 at 5:16

3 Answers 3

3

My solution:

<?php
$URLRegex = '/(?:(?<!(\[\/url\]|\[\/url=))(\s|^))';     // No [url]-tag in front and is start of string, or has whitespace in front
$URLRegex.= '(';                                        // Start capturing URL
$URLRegex.= '(https?|ftps?|ircs?):\/\/';                // Protocol
$URLRegex.= '\S+';                                      // Any non-space character
$URLRegex.= ')';                                        // Stop capturing URL
$URLRegex.= '(?:(?<![[:punct:]])|(?<=\/))(\s|\.?$)/i';  // Doesn't end with punctuation (excluding /) and is end of string (with a possible dot at the end), or has whitespace after

$Str = preg_replace($URLRegex,"$2[url]$3[/url]$5",$Str);
?>
Sign up to request clarification or add additional context in comments.

2 Comments

Also allows a dot after the URL if it's at the end of the string (meaning, it will not be a part of the link).
Can you edit your answer to also match on urls that end in '/'? Not matching on urls that end in punctation is great, except that ending in '/' is almost always part of the url. I tried modifying [:punct:] using character class subtraction, but unfortunately that's not supported in PCRE.
1

There's an excellent URL-matching regular expression here:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

1 Comment

I tried the regex, but it catches [ and ] as well, so this doesn't work for me.
1

Linkifying unlinked URLs is not trivial. There are a lot of gotchas (See: The Problem with URLs) and the thread of comments following this blog entry. The problem is compounded when you have URLs which are already linked that you wish to skip over. I have looked into this problem and have been working on a solution - an open source project: LinkifyURL. Here is the most recent incarnation of a function which does what you are asking. Note that the regex is NOT trivial (but neither is the problem as it turns out).

function linkify($text) {
    $url_pattern = '/# Rev:20100913_0900 github.com\/jmrware\/LinkifyURL
    # Match http & ftp URL that is not already linkified.
      # Alternative 1: URL delimited by (parentheses).
      (\()                     # $1  "(" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $2: URL.
      (\))                     # $3: ")" end delimiter.
    | # Alternative 2: URL delimited by [square brackets].
      (\[)                     # $4: "[" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $5: URL.
      (\])                     # $6: "]" end delimiter.
    | # Alternative 3: URL delimited by {curly braces}.
      (\{)                     # $7: "{" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $8: URL.
      (\})                     # $9: "}" end delimiter.
    | # Alternative 4: URL delimited by <angle brackets>.
      (<|&(?:lt|\#60|\#x3c);)  # $10: "<" start delimiter (or HTML entity).
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $11: URL.
      (>|&(?:gt|\#62|\#x3e);)  # $12: ">" end delimiter (or HTML entity).
    | # Alternative 5: URL not delimited by (), [], {} or <>.
      (                        # $13: Prefix proving URL not already linked.
        (?: ^                  # Can be a beginning of line or string, or
        | [^=\s\'"\]]          # a non-"=", non-quote, non-"]", followed by
        ) \s*[\'"]?            # optional whitespace and optional quote;
      | [^=\s]\s+              # or... a non-equals sign followed by whitespace.
      )                        # End $13. Non-prelinkified-proof prefix.
      ( \b                     # $14: Other non-delimited URL.
        (?:ht|f)tps?:\/\/      # Required literal http, https, ftp or ftps prefix.
        [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]+ # All URI chars except "&" (normal*).
        (?:                    # Either on a "&" or at the end of URI.
          (?!                  # Allow a "&" char only if not start of an...
            &(?:gt|\#0*62|\#x0*3e);                  # HTML ">" entity, or
          | &(?:amp|apos|quot|\#0*3[49]|\#x0*2[27]); # a [&\'"] entity if
            [.!&\',:?;]?        # followed by optional punctuation then
            (?:[^a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]|$)  # a non-URI char or EOS.
          ) &                  # If neg-assertion true, match "&" (special).
          [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]* # More non-& URI chars (normal*).
        )*                     # Unroll-the-loop (special normal*)*.
        [a-z0-9\-_~$()*+=\/#[\]@%]  # Last char can\'t be [.!&\',;:?]
      )                        # End $14. Other non-delimited URL.
    /imx';
    $url_replace = '$1$4$7$10$13<a href="$2$5$8$11$14">$2$5$8$11$14</a>$3$6$9$12';
    return preg_replace($url_pattern, $url_replace, $text);
}

This solution does have some limitations, and recently I have been working on an improved version (which is simpler and works better) - but it is not yet ready for prime-time.

Be sure to take a look at the linkify test page where I have put together a list of really-hard-to-match-in-the-wild URLs.

2 Comments

Curious if you ever got your improved version working?
@Jeff Widman - Sorry, but I never finished the newer version - this one here is still the best one I've got for the moment.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.