Using regex to remove HTML tags

Question

I need to convert

$text = 'We had <i>fun</i>. Look at <a href="http://example.com">this photo</a> of Joe';

[Edit] There could be multiple links in the text.

to

$text = 'We had fun. Look at this photo (http://example.com) of Joe';

All HTML tags are to be removed and the href value from <a> tags needs to be added like above.

What would be an efficient way to solve this with regex? Any code snippet would be great.

You don't want to solve that with Regex. Use DOM if you care for your sanity. — Gordon
– Gordon, Commented May 5, 2010 at 17:57
I dunno, Gordon. I extracted the url with a regex much easier than fiddling with the DOM. — Timothy
– Timothy, Commented May 5, 2010 at 18:00
So your title wants regex but your body question doesn't. Which one is it? — waiwai933
– waiwai933, Commented May 5, 2010 at 18:11

nc3b · Accepted Answer · 2010-05-05 18:00:31Z

5

First do a preg_replace to keep the link. You could use:

preg_replace('<a href="(.*?)">(.*?)</a>', '$\2 ($\1)', $str);

Then use strip_tags which will finish off the rest of the tags.

answered May 5, 2010 at 18:00

nc3b

16.4k5 gold badges53 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Javier Parra Over a year ago

This won't work well, as it's been thoroughly explained in here html is too complex to be parsed using regex. For instance, that simple one will break when using single quotes instead of doubles in the href attribute (to fix that change the first double quote with: ([\'\"]) and the second with a backreference)

nc3b Over a year ago

I agree. (X)HTML is complex and one should think twice before parsing it with a regular expression. That said, for a quick one-off DOM might be overkill.

Gordon Over a year ago

@Lost_in_code This will fail if the user added any other attribute to the link, e.g <a class="foo" href="... or title or rel or whatever else is possible there. It will also not work with <a href = "..." or uppercase, etc. - just believe us: Regex sucks for this :)

Javier Parra · Accepted Answer · 2010-05-05 17:58:24Z

1

try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.

http://www.php.net/manual/en/book.domxml.php

answered May 5, 2010 at 17:58

Javier Parra

2,1002 gold badges18 silver badges32 bronze badges

Comments

Gordon · Accepted Answer · 2010-05-05 19:01:37Z

The DOM solution:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[@href]') as $node) {
    $textNode = new DOMText(sprintf('%s (%s)',
        $node->nodeValue, $node->getAttribute('href')));
    $node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());

and the same without XPath:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
    if($node->hasAttribute('href')) {
        $textNode = new DOMText(sprintf('%s (%s)',
            $node->nodeValue, $node->getAttribute('href')));
        $node->parentNode->replaceChild($textNode, $node);
    }
}
echo strip_tags($dom->saveHTML());

All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.

Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.

Frank V · Accepted Answer · 2010-05-05 18:21:59Z

0

I've done things like this using variations of substring and replace. ~~I'd probably use regex today~~ but you wanted an alternative so:

For the <i> tags, I'd do something like:

$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");

(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)

The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>

That might go something like:

$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text,  $start, $end );
$text = replace($text, "</a>", "");

(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)

Reference:

strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php

edited May 5, 2010 at 18:21

answered May 5, 2010 at 18:15

Frank V

25.5k36 gold badges109 silver badges145 bronze badges

1 Comment

Yeti Over a year ago

Thanks for this, but I've edited the question, since regex seems to be the way to go. It's much simpler and quick.

Erik · Accepted Answer · 2010-05-05 19:29:30Z

0

It's also very easy to do with a parser:

# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');

# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at <a href="http://example.com">this photo</a> of Joe');

$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";

echo strip_tags($html);

And that produces the code you want in your test case.

answered May 5, 2010 at 19:29

Erik

20.7k8 gold badges47 silver badges77 bronze badges

Collectives™ on Stack Overflow

Using regex to remove HTML tags

5 Answers 5

3 Comments

Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related