1

I should remove all content inside <script> and CDATA of HTML string.

I'm using a code like this:

$content = "
TEST1
<script type='text/javascript'>
/* <![CDATA[ */
var markers = [{'ID':3681,'post_author':'4'}]
/* ]]> */
</script>
TEST2
";

libxml_use_internal_errors(true);
$domDoc = new DOMDocument();
$domDoc->loadHTML($content);
libxml_clear_errors();

foreach($domDoc->getElementsByTagName('script') as $scripttag){
    $scripttag->parentNode->removeChild($scripttag);
}

But it won't work. Nothing is remove.

It's OK, if I'm using RegEx expression like this

$re = '/<script\b[^>]*>.*?<\/script>/is';
$str = 'TEST1
<script type=\'text/javascript\'>
/* <![CDATA[ */
var markers = [{\'ID\':3681,\'post_author\':\'4\'}]
/* ]]> */
</script>
TEST2';

$content= preg_replace($re, '', $str, 1);

Is it possible to use PHP DOMDocument, not RegEx expression, for remove this type of content?

EDIT with Hatef answer

$content = "
<script type='text/javascript'>
/* <![CDATA[ */
var _cf7 = {'recaptcha':{'messages':{'empty':'Merci de confirmer que vous n\u2019\u00eates pas un robot.'}},'cached':'1'};
/* ]]> */
</script>
<script type='text/javascript' src='https://www.test.com/includes/js/scripts.js'></script>
<script type='text/javascript'>
/* <![CDATA[ */
var pollsL10n = {'ajax_url':'https:\/\/www.test.com\/ajax.php','text_wait':'Your last request is still being processed. Please wait a while ...','text_valid':'Please choose a valid poll answer.','text_multiple':'Maximum number of choices allowed:','show_loading':'1','show_fading':'1'};
/* ]]> */
</script>
<!--[if lt IE 8]>
<script type='text/javascript' src='https://www..test.com/json2.min.js'></script>
<![endif]--><script type='text/javascript'>
/* <![CDATA[ */
var ajaxurl = 'https:\/\/.test.com\/ajax.php';
/* ]]> */
</script>
<script type='text/javascript' src='https://www.test.com/slider.min.js?x40297'></script>
<script>
        (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
        (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
        m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
        })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

        ga('create', 'UA-37273722-1', 'auto');
        ga('send', 'pageview');
</script>
";

libxml_use_internal_errors(true);
$domDoc = new DOMDocument();
$domDoc->loadHTML($content);
libxml_clear_errors();

foreach($domDoc->getElementsByTagName('script') as $scripttag){
    $scripttag->parentNode->removeChild($scripttag);
}
$content = $domDoc->saveHTML();

$content contain

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><script type="text/javascript" src="https://www.test.com/includes/js/scripts.js"></script><!--[if lt IE 8]>
<script type='text/javascript' src='https://www..test.com/json2.min.js'></script>
<![endif]--><script type="text/javascript">
/* <![CDATA[ */
var ajaxurl = 'https:\/\/.test.com\/ajax.php';
/* ]]> */
</script><script>
        (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
        (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
        m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
        })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

        ga('create', 'UA-37273722-1', 'auto');
        ga('send', 'pageview');
</script></head></html>

1 Answer 1

2

Your DOMDocument solution works perfectly; you are just missing the last line to actually save the HTML:

$content = $domDoc->saveHTML();

As you may already know, it's better not to use regex to parse HTML.


This one should work with your new example:

$scriptTags = $domDoc->getElementsByTagName('script');

while($scriptTags->length > 0){
    $scriptTag = $scriptTags->item(0);
    $scriptTag->parentNode->removeChild($scriptTag);
}
Sign up to request clarification or add additional context in comments.

2 Comments

Yep ! It's OK with my example. But, with Curl, when I load maisons-qualite.com/le-reseau-mdq/…, Not all <script> tags are removed. Strange ....
Is it possible the HTML tags are not well structured in your page? Please add some relevant example that actually fails.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.