0

With PHP, in HTML file, I want to remove the CDATA blocks inside a script element.

<script type="text/javascript">
    /* <![CDATA[ */
    var A=new Array();
    ..........................
    ..........................
/* ]]> */
</script>
some text2 ........................
some text3 ........................
some text4 ........................
<script type="text/javascript">
    /* <![CDATA[ */
    var B=new Array();
    ..........................
    ..........................
/* ]]> */
some text5 ........................

I haven't found how to select & remove this nodes with XPath & PHP DomDocument.

I tried with this regular expression $re = '/\/\*\s*<!\[CDATA\[[\s\S]*\/\*\s*\]\]>\s*\*\//i';

But this removes all text including the one between 2 blocks of CDATA.

As a result I get an empty string instead of

some text2 ........................ 
some text3 ........................ 
some text4 ........................ 
some text5 ........................

Any ideas?

Update with ThW solution :

With this page, It seems that the text of the CDATA section is not well parsed

libxml_use_internal_errors(true);
$domDoc = new DOMDocument();
$domDoc->loadHTMLFile('https://www.maisons-qualite.com/le-reseau-mdq/recherche-constructeurs-agrees/construction-maison-neuve-centre-val-loire');
libxml_clear_errors();

$xpath = new DOMXpath($domDoc);
foreach($xpath->evaluate('//text()') as $section) {
  if ($section instanceof DOMCDATASection) {
    print_r($section->textContent);
    $section->parentNode->removeChild($section);
  }
}
$content = $domDoc->saveHTML();

I got this textContent

.....
.....
function updateConstructeurs(list) {
    for (var i in list) {
        if(list[i]['thumbnail']) {
            jQuery('#reseau-constructeurs').append('<div class="reseau-constructeur">' +
                '<div class="img" style="background-image:url(' + list[i]['thumbnail'] + ')">

for

function updateConstructeurs(list) {
    for (var i in list) {
        if(list[i]['thumbnail']) {
            jQuery('#reseau-constructeurs').append('<div class="reseau-constructeur">' +
                '<div class="img" style="background-image:url(' + list[i]['thumbnail'] + ')"></div>' +
                '<h3>' + list[i]['title'] + '</h3>' +
                '<a class="btn purple" href="' + list[i]['link'] + '">Accéder à la fiche</a>' +
            '</div>');
        }
    }
}

And as a result, instead of getting an empty string, we have :

                        '<h3>' + list[i]['title'] + '</h3>' +
                        '<a class="btn purple" href="'%20+%20list%5Bi%5D%5B'link'%5D%20+%20'">Acc&eacute;der &agrave; la fiche</a>' +
                    '</div>');
                }
            }
        }
    /* ]]&gt; */

3 Answers 3

1

Make the [\s\S]* non-greedy, i.e. [\s\S]*?:

\/\*\s*<!\[CDATA\[[\s\S]*?\/\*\s*\]\]>\s*\*\/

Demo: https://regex101.com/r/AutLW9/1

Sign up to request clarification or add additional context in comments.

2 Comments

Seem not to work. Display processing... with no result
Same error but it's OK in PHP. I post your solution in PHP.
0

Dmitry Egorov solution in PHP.

$re = '/\/\*\s*<!\[CDATA\[[\s\S]*?\/\*\s*\]\]>\s*\*\//';
$str = '<script type="text/javascript">
    /* <![CDATA[ */
    var A=new Array();
    ..........................
    ..........................
/* ]]> */
</script>
some text2 ........................
some text3 ........................
some text4 ........................
<script type="text/javascript">
    /* <![CDATA[ */
    var B=new Array();
    ..........................
    ..........................
/* ]]> */
</script>
some text5 ........................';
$subst = '';

$result = preg_replace($re, $subst, $str);

echo "The result of the substitution is ".$result;

Comments

0

CData sections are a type of character nodes, like text nodes. For most purpose you handle them the same way - the difference is in the serialization. So fetch the nodes using Xpath and remove them if they are CDATA sections (and not text nodes):

$document = new DOMDocument();
$document->loadHtml($html);
$xpath = new DOMXpath($document);

foreach($xpath->evaluate('//text()') as $section) {
  if ($section instanceof DOMCDATASection) {
    $section->parentNode->removeChild($section);
  }
}

echo $document->saveHtml();

However you might want to rethink that. It is really important to have no CDATA sections? You might want to remove the content of script elements. This is even shorter:

$document = new DOMDocument();
$document->loadHtml($html);
$xpath = new DOMXpath($document);

foreach($xpath->evaluate('//script/node()') as $node) {
  $node->parentNode->removeChild($section);
}

echo $document->saveHtml();

//script/node() matches any child node inside a script element. Be it a CDATA section, text node or anything else.

1 Comment

Goog solution with no use of RegExp. But I have a bug. I update my post with it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.