Removing specific html tags using preg_replace without removing content

Question

I'm trying to clean up excess html based on css classes. I don't want to remove all tags of a certain type, just specific tags, and I want to keep the content within them in tact. I'm trying variations along the lines of this:

$content = preg_replace(
    '#(<div class\=\"removethis\">(^.*)</div>)#is', 
    '', 
    $content
);

I realise the above code can't work, but hopefully it'll help explain what I'm trying to do. I'm not that familiar with regular expressions, so I haven't found anything that works so far.

strip_tags would remove all elements of a certain type, not all elements with a specific css class. — user1106248
– user1106248, Commented Dec 19, 2011 at 16:39

mario · Accepted Answer · 2011-12-19 16:21:38Z

3

The ^ is likely wrong there. That looks for the start of the subject, not even for the start of a line; and that won't occur at this position.

And you are replacing it with '' nothing, instead of the contents of the first '$1' capture group.

And the off-topic answer for repwhoring: You could alternatively use querypath or another library for managing html content. Then the replacement gets simpler:

  htmlqp($html)->remove("div.removethis")->...()->writeHTML();

Often inappropriate for output transformation. But easier and more useful in other cases.

answered Dec 19, 2011 at 16:21

mario

146k20 gold badges243 silver badges293 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ridgerunner · Accepted Answer · 2011-12-19 23:59:04Z

Disclaimer: Don't use regex!

It is not recommended to use regular expressions to parse HTML (or any other non-regular language). There are many pitfalls and ways for the solution to fail. That said, I do thoroughly enjoy using regular expressions to solve complex problems such as this one which involves nested structures. If someone else provides a working non-regex solution, I would recommend that you use that one instead of the following.

A regex solution:

The following solution implements a recursive regular expression which is used in conjunction with the preg_replace_callback() function, (which calls itself recursively when the contents of a DIV element contains a nested DIV element). The regular expression matches the outermost DIV element (which may contain nested DIV elements). The callback function strips the start and end tags of only those DIV elements having a class attribute that includes: removethis. The DIV tags that do not have the removethis class are preserved. (The removethis value is stored in a variable at the top of the following working script which can be easily changed to suit.) I think you will find that this does a pretty good job:

function stripSpecialDivTags($text)

<?php // test.php Rev:20111219_1600
// Remove DIV start and end tags having this class attribute:
$class_to_remove = "removethis";
// Recursive regex matches an outermost DIV element and its contents.
$re = '% # Match outermost DIV element.
    <                     # Start of HTML start tag
    (                     # $1: DIV element start tag.
      div                 # Tag name = DIV
      (                   # $2: DIV start tag attributes.
        (?:               # Group for zero or more attributes.
          \s+             # Required whitespace precedes attrib.
          [\w.\-:]+       # Attribute name.
          (?:             # Group for optional attribute value.
            \s*=\s*       # Name and value separated by =
            (?:           # Group for value alternatives.
              \'[^\']*\'  # Either single quoted,
            | "[^"]*"     # or double quoted,
            | [\w.\-:]+   # or unquoted value.
            )             # End group of value alternatives.
          )?              # Attribute value is optional.
        )*                # Zero or more attributes.
      )                   # End $2: DIV start tag attributes.
      \s*                 # Optional whitespace before closing >.
      >                   # End DIV element start tag.
    )                     # End $1: DIV element start tag.
    (                     # $3: DIV element contents.
      (?:                 # Group for zero or more content alts.
        (?R)              # Either a nested DIV element.
      |                   # or non-DIV tag stuff.
        [^<]*             # {normal*} Non-< start of tag stuff.
        (?:               # Begin "unrolling-the-loop".
          <               # {special} A "<", but only if it is
          (?:!/?div)      # NOT start of a <div or </div
          [^<]*           # more {normal*} Non-< start of tag.
        )*                # End {(special normal*)*} construct.
      )*                  # Zero or more content alternatives.
    )                     # End $3: DIV element contents.
    </div\s*>             # DIV element end tag.
    %xi';

// Remove matching start and end tags of DIV elements having specific class.
function stripSpecialDivTags($text) {
    global $re;
    $text = preg_replace_callback($re,
            '_stripSpecialDivTags_cb', $text);
    $text = str_replace("<\0", '<', $text);
    return $text;
}
function _stripSpecialDivTags_cb($matches) {
    global $re, $class_to_remove;
    if (preg_match($re, $matches[3])) {
        $matches[3] = preg_replace_callback($re,
            '_stripSpecialDivTags_cb', $matches[3]);
    }
    // Regex to match class attribute and capture value in $1.
    $re_class = '/ ^      # Anchor to start of attributes string.
        (?:               # Zero or more non-class attributes.
          \s+             # Required whitespace precedes attrib.
          (?!class\b)     # Match any attribute other than "CLASS".
          [\w.\-:]+       # Attribute name.
          (?:             # Group for optional attribute value.
            \s*=\s*       # Name and value separated by =.
            (?:           # Group for value alternatives.
              \'[^\']*\'  # Either single quoted,
            | "[^"]*"     # or double quoted,
            | [\w.\-:]+   # or unquoted value.
            )             # End group of value alternatives.
          )?              # Attribute value is optional.
        )*                # Zero or more non-class attributes.
        \s+               # Required whitespace precedes attrib.
        class\s*=\s*      # "CLASS" is the attribute we need.
        (?|               # Use branch reset to capture value in $1.
          \'([^\']*)\'    # Either $1.1: a single quoted,
        | "([^"]*)"       # or $1.2: a double quoted,
        | ([\w.\-:]+)     # or $1.3: an un-quoted value.
        )                 # End branch reset to capture value in $1.
        /ix';
    $re_remove = '%(?<=^|\s)'.preg_quote($class_to_remove, '%').'(?=\s|$)%';
    if (preg_match($re_class, $matches[2], $m)) {// If DIV has a CLASS,
        if (preg_match($re_remove, $m[1])) { // AND it has special value,
            return $matches[3];     // Then strip start and end DIV tags.
        }
    }
    // Hide the start and end tags by inserting a temporary null char.
    return "<\0". $matches[1] . $matches[3] . "<\0/div>";
}
$data = file_get_contents('testdata.html');
$output = stripSpecialDivTags($data);
file_put_contents('testdata_out.html', $output);
?>

Example Input:

<div class="do not remove">
    <div class=removethis>
        <div>
            <div class='do removethis one too'>
                <div class="dontremovethisone">
                </div>
            </div>
        </div>
    </div>
</div>

Example Output:

<div class="do not remove">

        <div>

                <div class="dontremovethisone">
                </div>

        </div>

</div>

The complexity of the regex is required to properly handle tag attributes having values that may contain <> angle brackets.

Not offended by other comment. Just wanted to know. I added the regex tag myself to bring in more expertise. Anyway, I see what you mean and deleted ans.

maček · Accepted Answer · 2011-12-19 16:24:23Z

0

Do not parse HTML with regex. You should be using strip_tags

$html = '<div class="foo">Hello world. <b>I am bold!</b></div>';

$allowed_tags = "<b>";

$text = strip_tags($html, $allowed_tags);

echo $text; #=> Hello world. <b>I am bold!</b>

answered Dec 19, 2011 at 16:24

maček

78k37 gold badges172 silver badges200 bronze badges

1 Comment

user1106248 Over a year ago

Thanks, but I don't think you've read the question... strip_tags would remove all elements of a certain type, not all elements with a specific css class.

Collectives™ on Stack Overflow

Removing specific html tags using preg_replace without removing content

3 Answers 3

Comments

Disclaimer: Don't use regex!

A regex solution:

function stripSpecialDivTags($text)

Example Input:

Example Output:

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Disclaimer: Don't use regex!

A regex solution:

function stripSpecialDivTags($text)

Example Input:

Example Output:

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related