2

I want a regex in php to strip all attributes except: 'href', 'target', 'style', 'color', 'src', 'alt', 'border', 'cellpadding', 'cellspacing', 'width', 'height', 'title'

So that these are valid attributes:

<a href=i.php>
<a href = "i.php">
<img alt= " " src ="img.png">
<p title='Desc' style=color:FFFFFF;>

but these aren't valid attributes:

<a onclick="alert('Hello');">
<div id="whatever">
<div id = "whatever">
<div id = whatever> ..etc

I tried this, but it didn't work well

$cont = $_POST['mycontent'];
$keep = array('href', 'target', 'style', 'color', 'src', 'alt', 'border', 'cellpadding', 'cellspacing', 'width', 'height', 'title');

// Get an array of all the attributes and their values in the data string
preg_match_all('/[a-z]+\s*=/iU', $cont, $attributes);

// Loop through the attribute pairs, match them against the keep array and remove
// them from $data if they don't exist in the array
foreach ($attributes[0] as $attribute) {
    $attributeName = stristr(trim($attribute), '=', true);
    if (!in_array($attributeName, $keep)) {
        $cont = str_replace(' ' . $attribute, '', $cont);
    }
}

Help?

5
  • Did you consider using DOM for this task? It seems DOM::removeAttribute() is the safest. Commented Jul 7, 2015 at 10:02
  • @stribizhev but I want to remove the attributes from server side, before inserting the post into database Commented Jul 7, 2015 at 10:07
  • Look into HTMLPurifier, which already covers that. If it's meant as security feature then you'd end up with similar complex regexps anyway. Commented Jul 7, 2015 at 10:10
  • @regexps but I want to remove all attributes except some! Commented Jul 7, 2015 at 10:11
  • @mario and I dont want to remove html tags, just attributes Commented Jul 7, 2015 at 10:12

1 Answer 1

3

You almost done, let me suggest some changes, I haven't tested it yet:

Change your regex to

// Get an array of all the attributes and their values in the data string
preg_match_all('/([a-z]+\s*)=(\"|\')[a-zA-Z0-9|:|;]*(\"|\')/iU', $cont, $attributes);

and then

for(int $i = 0; $i < count($attributes[1]); $i++) {
    $attribute = $attributes[1][$i];
    if (!in_array($attribute, $keep)) {
        $cont = str_replace(' ' . $attributes[0][$i], '', $cont);
    }
}

I believe this will help you

Sign up to request clarification or add additional context in comments.

3 Comments

but I want the regex to match when there is a space between the "=" and the attribute value, example: <p id= "description">
and I want it to strip attributes too when there isn't any quotations. example: <p id=descriptions>
What about modifying the regex that @josedefreitasc gave you? /([a-z]+\s*)\s*=\s*(\"|\')*[\/\-_a-zA-Z0-9|:|;]*(\"|\')*/iU

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.