Remove attributes from html tags using PHP while keeping specific attributes

Question

I found a way to remove all tag attributes from a html string using php:

$html_string = "<div class='myClass'><b>This</b> is an <span style='margin:20px'>example</span><img src='ima.jpg' /></div>";
$output = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $html_string);
echo $output;
//<div><b>This</b> is an <span>example</span><img/></div>

But I would like to keep certain tags such as src and href. I have almost no experience with regular expresions, so any help would be really appreciated.

[maybe] Relevant update: This is parto of a process of 'cleaning' posts on a database. I am iterating through all the posts, getting the html, cleaning it, and updating it on the corresponding table.

There are too many ways to malform html tags that trip up regex — Drakes
– Drakes, Commented Apr 11, 2015 at 4:31
That was the way that got closer to what I was looking for, but I am sure open to better means. — Multitut
– Multitut, Commented Apr 11, 2015 at 4:32
You may be interested in another topic discussed here: stackoverflow.com/questions/317053/… — Droopy4096
– Droopy4096, Commented Apr 11, 2015 at 5:30

Drakes · Accepted Answer · 2015-04-11 05:40:56Z

7

You usually should not parse HTML using regular expressions. Instead, in PHP you should call DOMDocument::loadHTML. You can then recurse through the elements in the document and call removeAttribute. Regular expressions for HTML tags are notoriously tricky.

REF: http://php.net/manual/en/domdocument.loadhtml.php

Examples: http://coursesweb.net/php-mysql/html-attributes-php

Here's a solution for you. It will iterate over all tags in the DOM, and remove attributes which are not src or href.

$html_string = "<div class=\"myClass\"><b>This</b> is an <span style=\"margin:20px\">example</span><img src=\"ima.jpg\" /></div>";

$dom = new DOMDocument;                 // init new DOMDocument
$dom->loadHTML($html_string);           // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {             
    if($node->nodeName != "src" && $node->nodeName != "href") {
        $node->parentNode->removeAttribute($node->nodeName);
    }
}

echo $dom->saveHTML();                  // output cleaned HTML

Here is another solution using xPath to filter on attribute names instead:

$dom = new DOMDocument;                 // init new DOMDocument
$dom->loadHTML($html_string);           // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//@*[local-name() != 'src' and local-name() != 'href']");
foreach ($nodes as $node) {             
    $node->parentNode->removeAttribute($node->nodeName);
}

echo $dom->saveHTML();                  // output cleaned HTML

Tip: Set the DOM parser to UTF-8 if you are using extended character like this:

$dom->loadHTML(mb_convert_encoding($html_string, 'HTML-ENTITIES', 'UTF-8'));

edited Apr 11, 2015 at 5:40

answered Apr 11, 2015 at 4:28

Drakes

23.8k3 gold badges58 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Casimir et Hippolyte Over a year ago

Instead of checking the attribute name in an if statement you can do it inside the xpath query (you will save a lot of iterations).

Multitut Over a year ago

Thanks, this worked pretty well. I only prepended a charset because the result had some weird character formatting: $dom->loadHTML('<?xml encoding="utf-8" ?>' . $f->description);

Drakes Over a year ago

@CasimiretHippolyte Challenge, accepted! Multitut, thanks for that, I'll update my answer for future readers

stack Over a year ago

Upvote, However your code doesn't work correctly. See this please, and focus on <p> tag. The place of it isn't as expected. Can you please fix your code?

Drakes Over a year ago

@stack. You have two choices: Either wrap all your HTML in one container (i.e. have a single root node), or allow the DOMDocument to do this for you by removing LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD as options. I hope this helps you.

Collectives™ on Stack Overflow

Remove attributes from html tags using PHP while keeping specific attributes

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related