2

I need to find and replace http links to hyperlinks. These http links are inside span tags.

$text has html page. One of the span tags has something like

<span class="styleonetwo" >http://www.cnn.com/live-event</span>

Here is my code:

$doc = new DOMDocument();
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('span') as $anchor) {
    $link = $anchor->nodeValue;
    if(substr($link, 0, 4) == "http")
    {
        $link = "<a href=\"$link\">$link</a>";
    }
    if(substr($link, 0, 3) == "www")
    {
        $link = "<a href=\"http://$link\">$link</a>";
    }    
    $anchor->nodeValue = $link;
}
echo $doc->saveHTML();

It works ok. However...I want this to work even if the data inside span is something like:

<span class="styleonetwo" > sometexthere http://www.cnn.com/live-event somemoretexthere</span>

Obviously above code wont work for this situation. Is there a way we can search and replace a pattern using DOMDocument without using preg_replace?

Update: To answer phil's question regarding preg_replace:

I used regexpal.com to test the following pattern matching:

\b(?:(?:https?|ftp|file)://|(www|ftp)\.)[-A-Z0-9+&@#/%?=~_|$!:,.;]*[-A-Z0-9+&@#/%=~_|$]

It works great in the regextester provided in regexpal. When I use the same pattern in PHP code, I got tons of weird errors. I got unknown modifier error even for escape character! Following is my code for preg_replace

$httpRegex = '/\b(\?:(\?:https?|ftp|file):\/\/|(www|ftp)\.)[-A-Z0-9+&@#/%\?=~_|$!:,.;]*[-A-Z0-9+&@#/%=~_|$]/';
$cleanText = preg_replace($httpRegex, "<a href='$0'>$0</a>", $text);

I was so frustrated with "unknown modifiers" and pursued DOMDocument to solve my problem.

2
  • What's the problem with preg_replace()? Commented Oct 18, 2012 at 1:05
  • Your regex is not escaped. You have to escape the escape character and the delimiter! Commented Oct 18, 2012 at 2:01

1 Answer 1

2

Regular expressions well suit this problem - so better use preg_replace.

Now you just have several unescaped delimiters in your pattern, so escape them or choose another character as the delimiter - for instance, ^. Thus, the correct pattern would be:

$httpRegex = '^\b(?:(?:https?|ftp|file):\/\/|(www|ftp)\.)[-A-Z0-9+&@#\/%\?=~_|$!:,.;]*[-A-Z0-9+&@#\/%=~_|$]^i';
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Nikita! It helps. When I do preg_replace($httpRegex, "<a href='$0'>$0</a>", $text); it gives me a link without "http". I could replace the code with preg_replace($httpRegex, "http://$0", $text); But, it would give me codehttp:/code if the link in the text is codesomethingcode. I could have links like code<span>wwww.link.com</span>code or code<span>link.com</span>code. Should I need to write two regex to solve this? Thanks again.
I would use a preg_replace_callback function - here's an example: pastebin.com/GfPjtbku

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.