How to remove CDATA blocks inside a script element?

Question

With PHP, in HTML file, I want to remove the CDATA blocks inside a script element.

<script type="text/javascript">
    /* <![CDATA[ */
    var A=new Array();
    ..........................
    ..........................
/* ]]> */
</script>
some text2 ........................
some text3 ........................
some text4 ........................
<script type="text/javascript">
    /* <![CDATA[ */
    var B=new Array();
    ..........................
    ..........................
/* ]]> */
some text5 ........................

I haven't found how to select & remove this nodes with XPath & PHP DomDocument.

I tried with this regular expression $re = '/\/\*\s*<!\[CDATA\[[\s\S]*\/\*\s*\]\]>\s*\*\//i';

But this removes all text including the one between 2 blocks of CDATA.

As a result I get an empty string instead of

some text2 ........................ 
some text3 ........................ 
some text4 ........................ 
some text5 ........................

Any ideas?

Update with ThW solution :

With this page, It seems that the text of the CDATA section is not well parsed

libxml_use_internal_errors(true);
$domDoc = new DOMDocument();
$domDoc->loadHTMLFile('https://www.maisons-qualite.com/le-reseau-mdq/recherche-constructeurs-agrees/construction-maison-neuve-centre-val-loire');
libxml_clear_errors();

$xpath = new DOMXpath($domDoc);
foreach($xpath->evaluate('//text()') as $section) {
  if ($section instanceof DOMCDATASection) {
    print_r($section->textContent);
    $section->parentNode->removeChild($section);
  }
}
$content = $domDoc->saveHTML();

I got this textContent

.....
.....
function updateConstructeurs(list) {
    for (var i in list) {
        if(list[i]['thumbnail']) {
            jQuery('#reseau-constructeurs').append('<div class="reseau-constructeur">' +
                '<div class="img" style="background-image:url(' + list[i]['thumbnail'] + ')">

for

function updateConstructeurs(list) {
    for (var i in list) {
        if(list[i]['thumbnail']) {
            jQuery('#reseau-constructeurs').append('<div class="reseau-constructeur">' +
                '<div class="img" style="background-image:url(' + list[i]['thumbnail'] + ')"></div>' +
                '<h3>' + list[i]['title'] + '</h3>' +
                '<a class="btn purple" href="' + list[i]['link'] + '">Accéder à la fiche</a>' +
            '</div>');
        }
    }
}

And as a result, instead of getting an empty string, we have :

                        '<h3>' + list[i]['title'] + '</h3>' +
                        '<a class="btn purple" href="'%20+%20list%5Bi%5D%5B'link'%5D%20+%20'">Acc&eacute;der &agrave; la fiche</a>' +
                    '</div>');
                }
            }
        }
    /* ]]&gt; */

Dmitry Egorov · Accepted Answer · 2017-05-03 12:42:46Z

1

Make the [\s\S]* non-greedy, i.e. [\s\S]*?:

\/\*\s*<!\[CDATA\[[\s\S]*?\/\*\s*\]\]>\s*\*\/

Demo: https://regex101.com/r/AutLW9/1

answered May 3, 2017 at 12:42

Dmitry Egorov

9,6903 gold badges25 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

LeMoussel Over a year ago

Seem not to work. Display processing... with no result

LeMoussel Over a year ago

Same error but it's OK in PHP. I post your solution in PHP.

LeMoussel · Accepted Answer · 2017-05-03 13:06:06Z

0

Dmitry Egorov solution in PHP.

$re = '/\/\*\s*<!\[CDATA\[[\s\S]*?\/\*\s*\]\]>\s*\*\//';
$str = '<script type="text/javascript">
    /* <![CDATA[ */
    var A=new Array();
    ..........................
    ..........................
/* ]]> */
</script>
some text2 ........................
some text3 ........................
some text4 ........................
<script type="text/javascript">
    /* <![CDATA[ */
    var B=new Array();
    ..........................
    ..........................
/* ]]> */
</script>
some text5 ........................';
$subst = '';

$result = preg_replace($re, $subst, $str);

echo "The result of the substitution is ".$result;

answered May 3, 2017 at 13:06

LeMoussel

5,82915 gold badges78 silver badges130 bronze badges

Comments

ThW · Accepted Answer · 2017-05-04 07:09:07Z

0

CData sections are a type of character nodes, like text nodes. For most purpose you handle them the same way - the difference is in the serialization. So fetch the nodes using Xpath and remove them if they are CDATA sections (and not text nodes):

$document = new DOMDocument();
$document->loadHtml($html);
$xpath = new DOMXpath($document);

foreach($xpath->evaluate('//text()') as $section) {
  if ($section instanceof DOMCDATASection) {
    $section->parentNode->removeChild($section);
  }
}

echo $document->saveHtml();

However you might want to rethink that. It is really important to have no CDATA sections? You might want to remove the content of script elements. This is even shorter:

$document = new DOMDocument();
$document->loadHtml($html);
$xpath = new DOMXpath($document);

foreach($xpath->evaluate('//script/node()') as $node) {
  $node->parentNode->removeChild($section);
}

echo $document->saveHtml();

//script/node() matches any child node inside a script element. Be it a CDATA section, text node or anything else.

answered May 4, 2017 at 7:09

ThW

19.5k3 gold badges25 silver badges47 bronze badges

1 Comment

LeMoussel Over a year ago

Goog solution with no use of RegExp. But I have a bug. I update my post with it.

Collectives™ on Stack Overflow

How to remove CDATA blocks inside a script element?

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related