3

I found a way to remove all tag attributes from a html string using php:

$html_string = "<div class='myClass'><b>This</b> is an <span style='margin:20px'>example</span><img src='ima.jpg' /></div>";
$output = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $html_string);
echo $output;
//<div><b>This</b> is an <span>example</span><img/></div>

But I would like to keep certain tags such as src and href. I have almost no experience with regular expresions, so any help would be really appreciated.

[maybe] Relevant update: This is parto of a process of 'cleaning' posts on a database. I am iterating through all the posts, getting the html, cleaning it, and updating it on the corresponding table.

5
  • an example would be better. Commented Apr 11, 2015 at 4:26
  • What's wrong with html parsers? Why you prefer regex? Commented Apr 11, 2015 at 4:31
  • There are too many ways to malform html tags that trip up regex Commented Apr 11, 2015 at 4:31
  • That was the way that got closer to what I was looking for, but I am sure open to better means. Commented Apr 11, 2015 at 4:32
  • You may be interested in another topic discussed here: stackoverflow.com/questions/317053/… Commented Apr 11, 2015 at 5:30

1 Answer 1

7

You usually should not parse HTML using regular expressions. Instead, in PHP you should call DOMDocument::loadHTML. You can then recurse through the elements in the document and call removeAttribute. Regular expressions for HTML tags are notoriously tricky.

REF: http://php.net/manual/en/domdocument.loadhtml.php

Examples: http://coursesweb.net/php-mysql/html-attributes-php

Here's a solution for you. It will iterate over all tags in the DOM, and remove attributes which are not src or href.

$html_string = "<div class=\"myClass\"><b>This</b> is an <span style=\"margin:20px\">example</span><img src=\"ima.jpg\" /></div>";

$dom = new DOMDocument;                 // init new DOMDocument
$dom->loadHTML($html_string);           // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {             
    if($node->nodeName != "src" && $node->nodeName != "href") {
        $node->parentNode->removeAttribute($node->nodeName);
    }
}

echo $dom->saveHTML();                  // output cleaned HTML

Here is another solution using xPath to filter on attribute names instead:

$dom = new DOMDocument;                 // init new DOMDocument
$dom->loadHTML($html_string);           // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//@*[local-name() != 'src' and local-name() != 'href']");
foreach ($nodes as $node) {             
    $node->parentNode->removeAttribute($node->nodeName);
}

echo $dom->saveHTML();                  // output cleaned HTML

Tip: Set the DOM parser to UTF-8 if you are using extended character like this:

$dom->loadHTML(mb_convert_encoding($html_string, 'HTML-ENTITIES', 'UTF-8'));
Sign up to request clarification or add additional context in comments.

5 Comments

Instead of checking the attribute name in an if statement you can do it inside the xpath query (you will save a lot of iterations).
Thanks, this worked pretty well. I only prepended a charset because the result had some weird character formatting: $dom->loadHTML('<?xml encoding="utf-8" ?>' . $f->description);
@CasimiretHippolyte Challenge, accepted! Multitut, thanks for that, I'll update my answer for future readers
Upvote, However your code doesn't work correctly. See this please, and focus on <p> tag. The place of it isn't as expected. Can you please fix your code?
@stack. You have two choices: Either wrap all your HTML in one container (i.e. have a single root node), or allow the DOMDocument to do this for you by removing LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD as options. I hope this helps you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.