Disclaimer: Don't use regex!
It is not recommended to use regular expressions to parse HTML (or any other non-regular language). There are many pitfalls and ways for the solution to fail. That said, I do thoroughly enjoy using regular expressions to solve complex problems such as this one which involves nested structures. If someone else provides a working non-regex solution, I would recommend that you use that one instead of the following.
A regex solution:
The following solution implements a recursive regular expression which is used in conjunction with the preg_replace_callback() function, (which calls itself recursively when the contents of a DIV element contains a nested DIV element). The regular expression matches the outermost DIV element (which may contain nested DIV elements). The callback function strips the start and end tags of only those DIV elements having a class attribute that includes: removethis. The DIV tags that do not have the removethis class are preserved. (The removethis value is stored in a variable at the top of the following working script which can be easily changed to suit.) I think you will find that this does a pretty good job:
function stripSpecialDivTags($text)
<?php // test.php Rev:20111219_1600
// Remove DIV start and end tags having this class attribute:
$class_to_remove = "removethis";
// Recursive regex matches an outermost DIV element and its contents.
$re = '% # Match outermost DIV element.
< # Start of HTML start tag
( # $1: DIV element start tag.
div # Tag name = DIV
( # $2: DIV start tag attributes.
(?: # Group for zero or more attributes.
\s+ # Required whitespace precedes attrib.
[\w.\-:]+ # Attribute name.
(?: # Group for optional attribute value.
\s*=\s* # Name and value separated by =
(?: # Group for value alternatives.
\'[^\']*\' # Either single quoted,
| "[^"]*" # or double quoted,
| [\w.\-:]+ # or unquoted value.
) # End group of value alternatives.
)? # Attribute value is optional.
)* # Zero or more attributes.
) # End $2: DIV start tag attributes.
\s* # Optional whitespace before closing >.
> # End DIV element start tag.
) # End $1: DIV element start tag.
( # $3: DIV element contents.
(?: # Group for zero or more content alts.
(?R) # Either a nested DIV element.
| # or non-DIV tag stuff.
[^<]* # {normal*} Non-< start of tag stuff.
(?: # Begin "unrolling-the-loop".
< # {special} A "<", but only if it is
(?:!/?div) # NOT start of a <div or </div
[^<]* # more {normal*} Non-< start of tag.
)* # End {(special normal*)*} construct.
)* # Zero or more content alternatives.
) # End $3: DIV element contents.
</div\s*> # DIV element end tag.
%xi';
// Remove matching start and end tags of DIV elements having specific class.
function stripSpecialDivTags($text) {
global $re;
$text = preg_replace_callback($re,
'_stripSpecialDivTags_cb', $text);
$text = str_replace("<\0", '<', $text);
return $text;
}
function _stripSpecialDivTags_cb($matches) {
global $re, $class_to_remove;
if (preg_match($re, $matches[3])) {
$matches[3] = preg_replace_callback($re,
'_stripSpecialDivTags_cb', $matches[3]);
}
// Regex to match class attribute and capture value in $1.
$re_class = '/ ^ # Anchor to start of attributes string.
(?: # Zero or more non-class attributes.
\s+ # Required whitespace precedes attrib.
(?!class\b) # Match any attribute other than "CLASS".
[\w.\-:]+ # Attribute name.
(?: # Group for optional attribute value.
\s*=\s* # Name and value separated by =.
(?: # Group for value alternatives.
\'[^\']*\' # Either single quoted,
| "[^"]*" # or double quoted,
| [\w.\-:]+ # or unquoted value.
) # End group of value alternatives.
)? # Attribute value is optional.
)* # Zero or more non-class attributes.
\s+ # Required whitespace precedes attrib.
class\s*=\s* # "CLASS" is the attribute we need.
(?| # Use branch reset to capture value in $1.
\'([^\']*)\' # Either $1.1: a single quoted,
| "([^"]*)" # or $1.2: a double quoted,
| ([\w.\-:]+) # or $1.3: an un-quoted value.
) # End branch reset to capture value in $1.
/ix';
$re_remove = '%(?<=^|\s)'.preg_quote($class_to_remove, '%').'(?=\s|$)%';
if (preg_match($re_class, $matches[2], $m)) {// If DIV has a CLASS,
if (preg_match($re_remove, $m[1])) { // AND it has special value,
return $matches[3]; // Then strip start and end DIV tags.
}
}
// Hide the start and end tags by inserting a temporary null char.
return "<\0". $matches[1] . $matches[3] . "<\0/div>";
}
$data = file_get_contents('testdata.html');
$output = stripSpecialDivTags($data);
file_put_contents('testdata_out.html', $output);
?>
Example Input:
<div class="do not remove">
<div class=removethis>
<div>
<div class='do removethis one too'>
<div class="dontremovethisone">
</div>
</div>
</div>
</div>
</div>
Example Output:
<div class="do not remove">
<div>
<div class="dontremovethisone">
</div>
</div>
</div>
The complexity of the regex is required to properly handle tag attributes having values that may contain <> angle brackets.