0

Any regex ninjas out there to come up with a PHP solution to cleaning the tag from any http/url , but leaving the tag in the rest of the text?

eg:

the word <cite>printing</cite> is in http://www.thisis<cite>printing</cite>.com

should become:

the word <cite>printing</cite> is in http://www.thisisprinting.com
2
  • Quite a difficult task, have you elaborated something ? Commented Oct 24, 2013 at 21:49
  • 1
    Matching url's isn't easy, human error like not putting a space after a period . for sentences can play havoc. What parts of the url can you guarantee to be there ie https?://|www. If you can guarantee some string will exist, then removing tags wouldn't be hard Commented Oct 25, 2013 at 0:20

3 Answers 3

1

This is what I would do:

<?php
//a callback function wrapper for strip_tags
function strip($matches){
    return strip_tags($matches[0]);
}

//the string
$str = "the word <cite>printing<cite> is in http://www.thisis<cite>printing</cite>.com";
//match a url and call the strip callback on it
$str = preg_replace_callback("/:\/\/[^\s]*/", 'strip', $str);

//prove that it works
var_dump(htmlentities($str));

http://codepad.viper-7.com/XiPcs9

Sign up to request clarification or add additional context in comments.

Comments

1

Your appropriate regex for this substitution could be:

#(https?://)(.*?)<cite>(.*?)</cite>([^\s]*)#s
  1. s flag to match in all newlines.

  2. Using lazy selection between tags for being accurate not to escape more similar tags

Snippet:

<?php
$str = "the word <cite>printing<cite> is in http://www.thisis<cite>printing</cite>.com";
$replaced = preg_replace('#(https?://)(.*?)<cite>(.*?)</cite>([^\s]*)#s', "$1$2$3$4", $str);
echo $replaced;

// Output: the word <cite>printing<cite> is in http://www.thisisprinting.com

Live demo

2 Comments

Instead of using (.*?) I would suggest using ([^\s]*?). With the way you are doing it, if the url was the first part of the string without cite tags, subsequent <cite> tags would be removed from later text.
That's a great start - @JonathanKuhn's addition is priceless. How would I trap a second (third etc) instance of the string in the URL? eg: https://<cite>app</cite>leid.<cite>app</cite>le.com
0

Assuming you can identify URLs from your text you can:

$str = 'http://www.thisis<cite>printing</cite>.com';
$str = preg_replace('~</?cite>~i', "", $str);
echo $str;

OUTPUT:

http://www.thisisprinting.com

2 Comments

@HamZa: Assuming $str = 'http://www.thisis<cite>printing</cite>.com'; not the full HTML text.
you omitted printing here but he wants the output to be http://www.thisisprinting.com

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.