PHP regex to clean a specific string from URLs only

Question

Any regex ninjas out there to come up with a PHP solution to cleaning the tag from any http/url , but leaving the tag in the rest of the text?

eg:

the word <cite>printing</cite> is in http://www.thisis<cite>printing</cite>.com

should become:

the word <cite>printing</cite> is in http://www.thisisprinting.com

Matching url's isn't easy, human error like not putting a space after a period . for sentences can play havoc. What parts of the url can you guarantee to be there ie https?://|www. If you can guarantee some string will exist, then removing tags wouldn't be hard — gwillie
– gwillie, Commented Oct 25, 2013 at 0:20

Jonathan Kuhn · Accepted Answer · 2013-10-24 21:56:18Z

1

This is what I would do:

<?php
//a callback function wrapper for strip_tags
function strip($matches){
    return strip_tags($matches[0]);
}

//the string
$str = "the word <cite>printing<cite> is in http://www.thisis<cite>printing</cite>.com";
//match a url and call the strip callback on it
$str = preg_replace_callback("/:\/\/[^\s]*/", 'strip', $str);

//prove that it works
var_dump(htmlentities($str));

http://codepad.viper-7.com/XiPcs9

answered Oct 24, 2013 at 21:56

Jonathan Kuhn

15.3k3 gold badges34 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

revo · Accepted Answer · 2013-10-24 22:09:02Z

1

Your appropriate regex for this substitution could be:

#(https?://)(.*?)<cite>(.*?)</cite>([^\s]*)#s

s flag to match in all newlines.
Using lazy selection between tags for being accurate not to escape more similar tags

Snippet:

<?php
$str = "the word <cite>printing<cite> is in http://www.thisis<cite>printing</cite>.com";
$replaced = preg_replace('#(https?://)(.*?)<cite>(.*?)</cite>([^\s]*)#s', "$1$2$3$4", $str);
echo $replaced;

// Output: the word <cite>printing<cite> is in http://www.thisisprinting.com

Live demo

edited Oct 24, 2013 at 22:09

answered Oct 24, 2013 at 22:02

revo

49k15 gold badges84 silver badges123 bronze badges

2 Comments

Jonathan Kuhn Over a year ago

Instead of using (.*?) I would suggest using ([^\s]*?). With the way you are doing it, if the url was the first part of the string without cite tags, subsequent <cite> tags would be removed from later text.

user884899 Over a year ago

That's a great start - @JonathanKuhn's addition is priceless. How would I trap a second (third etc) instance of the string in the URL? eg: https://<cite>app</cite>leid.<cite>app</cite>le.com

anubhava · Accepted Answer · 2013-10-25 07:16:39Z

0

Assuming you can identify URLs from your text you can:

$str = 'http://www.thisis<cite>printing</cite>.com';
$str = preg_replace('~</?cite>~i', "", $str);
echo $str;

OUTPUT:

http://www.thisisprinting.com

edited Oct 25, 2013 at 7:16

answered Oct 24, 2013 at 21:48

anubhava

790k67 gold badges603 silver badges671 bronze badges

2 Comments

anubhava Over a year ago

@HamZa: Assuming $str = 'http://www.thisis<cite>printing</cite>.com'; not the full HTML text.

revo Over a year ago

you omitted printing here but he wants the output to be http://www.thisisprinting.com

Collectives™ on Stack Overflow

PHP regex to clean a specific string from URLs only

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related