Using regex to get urls with some rules

Question

A string contain many urls, how to get urls start not with [url], and end not with [/url]

Example:

A string contain many urls: https://stackoverflow.com/1 [url]https://stackoverflow.com/2[/url] https://stackoverflow.com/3 [url]https://stackoverflow.com/4[/url], how to get match urls?

In this sample, need only return https://stackoverflow.com/1 and https://stackoverflow.com/3.

url3 ends with |/url] your question is it only for this example or is it in general ? — Meninx - メネンックス
– Meninx - メネンックス, Commented Jun 24, 2016 at 13:13
@KalaMei do all your URLs start with http? See my answer below if so. — Mark He
– Mark He, Commented Jun 24, 2016 at 14:54

Meninx - メネンックス · Accepted Answer · 2016-06-25 23:26:18Z

1

I will underline only the regex expression as it is very important to get the urls. So it will be:

 (?!\[url\])\s+\bhttp:\/\/stackoverflow.com\/\d\s+(?<!\[\/url\])

you can see the result in this Url by using the php function preg_match_all

but before that let's understand every part of it (you can find this in the same site)

(?!\[url\])\s+\bhttp:\/\/stackoverflow.com\/\d\s+(?<!\[\/url\])

(?!\[url\]) Negative Lookahead - Assert that it is impossible to match the regex below
- \[ matches the character [ literally
- url matches the characters url literally (case insensitive)
- \] matches the character ] literally

\s+ match any white space character [\r\n\t\f ] Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
http: matches the characters http: literally (case sensitive)
\/ matches the character / literally
\/ matches the character / literally
stackoverflow matches the characters stackoverflow literally (case sensitive)
. matches any character (except newline)
com matches the characters com literally (case sensitive)
/ matches the character / literally
\d match a digit [0-9]
\s+ match any white space character [\r\n\t\f ] Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
(?<!\[\/url\]) Negative Lookbehind - Assert that it is impossible to match the regex below
- \[ matches the character [ literally
- \/ matches the character / literally
- url matches the characters url literally (case insensitive)
- \] matches the character ] literally

Finally you need to use the php function as follow:

preg_match_all("(?!\[url\])\s+\bhttp:\/\/stackoverflow.com\/\d\s+(?<!\[\/url\])", $input_lines, $output_array);

$input_lines is the variable that holds your string

$output_array the arrays that holds the urls

edited Jun 25, 2016 at 23:26

answered Jun 24, 2016 at 13:50

Meninx - メネンックス

6,47118 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Kala Mei Over a year ago

Thanks for your reply, but this regex rule will return all urls (1, 2, 3, 4), you can see the test at regex101.com/r/tB5lY0/2 or sandbox.onlinephpfunctions.com/code/…

Meninx - メネンックス Over a year ago

You need to use the negative lookbehind an lookahead ! I will edit my answer

Mark He · Accepted Answer · 2016-06-24 15:23:49Z

0

(?<!\[url\])(?![^\s]+\[\/url\])http[^\s]*

This will grab the all the URLs not enclosed in the tags you mentioned ([url] and [\url]). Note that this works for every URL, not just the one you listed (i.e http://stackoverflow.com), which I think is what you want. You can see the explanations and live demo for each rule on Regex101 - Link: https://regex101.com/r/wN9aX0/3

edited Jun 24, 2016 at 15:23

answered Jun 24, 2016 at 14:46

Mark He

7417 silver badges15 bronze badges

2 Comments

Maria Ivanova Over a year ago

This would also match http://stackoverflow.com/5[/url], i.e. if it doesn't start with [url] but ends with [/url] - not sure if this is OK with the OP.

Mark He Over a year ago

@MariaDeleva Oops, that was my bad, my negative lookahead was incorrect, fixed now.

Maria Ivanova · Accepted Answer · 2016-06-24 14:09:59Z

0

This is a little complicated pattern and probably won't work for all cases, but will work for most. If it doesn't work in a case you want it to work, I could tweak it further:

(?<!(\[url\]))[\s.:]((http|https)(:\/\/))?([[:alnum:]\-_]*)(([\.])([[:alnum:]\-_]*)){1,}([\/]([[:alnum:]\-_]*))*[.:;\s]((?!\[\/url\]))

answered Jun 24, 2016 at 14:09

Maria Ivanova

1,14610 silver badges19 bronze badges

Comments

Ehsan · Accepted Answer · 2016-06-24 14:12:10Z

0

This help you :

var patt =/(?:\bhttp:\/\/stackoverflow.com\/\d{1,})(?!\[\/url\])/;

Example :

<html>
<head></head>
    <body>
         <script>
             var patt =/(?:\bhttp:\/\/stackoverflow.com\/\d{1,})(?!\[\/url\])/;
             var str = "http://stackoverflow.com/2";
             if(patt.test(str))
                 alert("Valid");
             else
                 alert("Invalid");
        </script>
    </body>
</html>

answered Jun 24, 2016 at 14:12

Ehsan

13k3 gold badges27 silver badges46 bronze badges

Collectives™ on Stack Overflow

Using regex to get urls with some rules

4 Answers 4

2 Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related