PHP Regex - Remove text between tags

Question

I have this:

$text = 'text text text s html tagove
<div id="content">ss adsda sdsa </div>
oshte text s html tagove';
$content = preg_replace('/(<div\sid=\"content\">)[^<]+(<\/div>)/i', '', $text);
var_dump($content);

But if the <div id="content"></div> contains other tags, such as <b>,<i> etc, it does not work.

For example:

$text = 'text text text s html tagove
<div id="content"><b> stfu </b> ss adsda sdsa </div>
oshte text s html tagove';

Don't parse HTML with regex. Use one of the parsers you have in PHP. — Qtax
– Qtax, Commented Mar 10, 2012 at 1:52
I would not use STFU as an ilustration of your need. Is a bad word. — Marcello Grechi Lins
– Marcello Grechi Lins, Commented Mar 19, 2013 at 14:35
@MarcelloGrechiLins - I'm sure the Southern Tenant Farmers' Union might think differently! ;-) — ghoti
– ghoti, Commented Mar 21, 2013 at 11:16

ghoti · Accepted Answer · 2012-03-10 06:19:58Z

5

You can use lazy quantifiers instead.

$s="foo<div>Some content is <b>bold</b>.</div>bar\n";

print preg_replace("/<div>.+?<\/div>/i", "", $s);'

output:

foobar

UPDATE per comments:

[ghoti@pc ~]$ cat doit.php 
<?php

$text = 'text text text s html tagove
<div id="content"><b> stfu </b> ss adsda sdsa </div>
oshte text s html tagove';

print preg_replace('/<div id="content">.+?<\/div>/im', '', $text) .  "\n";

[ghoti@pc ~]$ php doit.php 
text text text s html tagove

oshte text s html tagove
[ghoti@pc ~]$

edited Mar 10, 2012 at 6:19

answered Mar 9, 2012 at 22:04

ghoti

47.2k8 gold badges70 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jonathan Kuhn Over a year ago

this only matches the div tag if there are no attributes like it has in the example.

Qtax Over a year ago

And it won't work, eg <div id="content">ss <div>adsda</div> sdsa </div>, -1. Don't parse HTML with regex.

ghoti Over a year ago

@Qtax - There's nothing wrong with parsing HTML with regex if you've got predictable input and the problem is within the realm of what a regex can handle. The OP was worried about embedde <b>, not embedded <div>s.

ghoti Over a year ago

@JonathanKuhn - this example was intended as a simple demonstration of a lazy quantifier. But okay, I'll add a correction to the OP's original preg_replace as an update. <sigh>

Graham Over a year ago

I agree. This works, and it addresses the OP's concerns. If handling HTML in RE is a bad idea, perhaps it's a downvote for this question, but not for the answer.

|

anubhava · Accepted Answer · 2012-03-10 04:43:17Z

2

Better to use DOM to handle HTML text parsing. Here is a DOM based code to remove your div tag:

$html = <<< EOF
text text text s html tagove
<div id="content">ss <div>abcd</div>adsda sdsa </div>
oshte text s html tagove
<div id="content">foo <div>bar</div>baz foo</div>
some more text here
EOF;

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nlist = $xpath->query("//div[@id='content']");
for($i=0; $i < $nlist->length; $i++) {
   $node = $nlist->item($i);
   $node->parentNode->removeChild($node);
}
$newHTML =  $doc->saveHTML();
echo $newHTML;

Thanks to @Qtax for pointing it out to me that original question has changed after I wrote my previous regex based answer.

OUTPUT:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>text text text s html tagove
</p>
oshte text s html tagove

some more text here</body></html>

edited Mar 10, 2012 at 4:43

answered Mar 9, 2012 at 22:03

anubhava

790k67 gold badges603 silver badges671 bronze badges

4 Comments

anubhava Over a year ago

@Qtax: Glad that at least you left a comment for down vote. If you can tell a bit more why it is worse I will really appreciate it.

Qtax Over a year ago

The code in your answer doesn't work or even attempt to solve the issue in question, read the question again. (Hint: He's having problems with nested tags.)

anubhava Over a year ago

Ah crap, you're right. However this nested tag thingy wasn't there originally and when I posted this answer. I myself keep writing on SO on various questions to NOT to use regex for HTML parsing (and you can see my warning on top of my answer) and it now came back to bite me :)

anubhava Over a year ago

@Qtax: I have edited and posted a DOM based code to remove the div tag.

Collectives™ on Stack Overflow

PHP Regex - Remove text between tags

2 Answers 2

6 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related