1

Using PHP i want to remove all HTML attributes except

"src" attribute from "img" tag

and

"href" attribute from "a" tag

My Input file is .html file which is been converted from .doc and .docx

My output file again should be HTML file with removed attribute

Kindly help me please

Edit ::

After Trying alexander script as below if i open the strip.html in code editor i don't see any changes

<?php
$path = '/var/www/strip.html';
$html = file_get_contents($path);
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->formatOutput = true;
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
if (false === ($elements = $xpath->query("//img"))) die('Error');

foreach ($elements as $element) {
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if ('src' !== $name) {
            $element->removeAttribute($name);
        }
    }
}

if (false === ($elements = $xpath->query("//a"))) die('Error');

foreach ($elements as $element) {
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if ('href' !== $name) {
            $element->removeAttribute($name);
        }
    }
}

$dom->saveHTMLFile($path);

?>
3
  • 1
    stackoverflow.com/questions/2994448/… Commented Apr 16, 2014 at 13:14
  • @stefan how to make it work as if i input html and click a button i should ask to save the processed html file ??? Commented Apr 16, 2014 at 13:21
  • That link should help you get started, I'm not going to architect your app for you but after you get your html, however that be, pass it through the regex(es). Commented Apr 16, 2014 at 13:27

1 Answer 1

2

Use DOMDocument class for parsing HTML ("a" and "img" tags processing):

$path = '/path/to/file.html';
$html = file_get_contents($path);
$dom = new DOMDocument();
//$dom->strictErrorChecking = false;
$dom->formatOutput = true;
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
if (false === ($elements = $xpath->query("//img"))) die('Error');

foreach ($elements as $element) {
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if ('src' !== $name) {
            $element->removeAttribute($name);
        }
    }
}

if (false === ($elements = $xpath->query("//a"))) die('Error');

foreach ($elements as $element) {
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if ('href' !== $name) {
            $element->removeAttribute($name);
        }
    }
}

$dom->saveHTMLFile($path);

Also, read why you can't parse [X]HTML with regex and take a look at useful xpath links.

Update (all tags with exception "a" and "img" attributes processing):

$path = '/path/to/file.html';
$html = file_get_contents($path);
$dom = new DOMDocument();
//$dom->strictErrorChecking = false;
$dom->formatOutput = true;
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
if (false === ($elements = $xpath->query("//*"))) die('Error');

foreach ($elements as $element) {
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if (('img' === $element->nodeName && 'src' === $name)
            || ('a' === $element->nodeName && 'href' === $name)
        ) {
            continue;
        }

        $element->removeAttribute($name);
    }
}

$dom->saveHTMLFile($path);
Sign up to request clarification or add additional context in comments.

16 Comments

THis is outputing the same file as input... What i did : saved the code you gave as php , changed the $path value to the input file path also added a new string $pathh for output path and in last line changed $path to $pathh .. loaded the php file in browser .i received the output as same as input in the $pathh dir .. attributes was not removed
@PHPGeany Definitely, you did something wrong, because this code is ok. Proof: codepad link
@PHPGeany "loaded the php file in browser"? Do you have installed http-server on your local machine? Did you try to run this code through console, like "php codesource.php"? Browsers haven't integrated php-interpreter, thats why loading php-code in browser does nothing.
find the edit i made in my original question i use lamp stack i tried using local browser as localhost/path.php
Do you have any errors/warnings in error_log? Try to change $dom->saveHTMLFile($path); to var_dump($dom->saveHTML()); to see if there is problems with attribs removing.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.