I should remove all content (and tags) between tags in a PHP string fetched from file_get_contents of a generic website URL. I'm using the RegEx expression:
preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $string);
It works fine, but my problem is that, if a script contains the CDATA tag, it won't work. An example of string would be:
<script type='text/javascript'>
/* <![CDATA[ */
var variable = {"ajax":"....."}
/* ]]> */
</script>
I guess that the problem is with those "/" and "/" tags.
I've already searched on google and on Stack Overflow, but ther is no question with that particular type of cdata tag (with /* and */), so nothing works.
Any suggestion?
Edit:
As Steve answered, i am now using a code like this:
foreach($dom->getElementsByTagName('script') as $scripttag){
$scripttag->parentNode->removeChild($scripttag);
}
And then i have:
foreach($dom->getElementsByTagName('ins') as $string) {
$string2 .= $string->nodeValue;
$string2 .= ' ';
}
But that returns a $string2 with script tags inside.
EDIT 2 (SOLVED): With Steve's help, I found out that using Xpath solves the problem:
$xpath = new DOMXpath($dom);
foreach ($xpath->query('//script') as $node) {
$node->parentNode->removeChild($node);
}
That removes script tags also inside another tag, for example:
<ins><script>First JS</script></ins>
<ins>Hello</ins>
<script>Second JS</script>
Will output
Hello
Thank you all for the help!
[...] I'm using the RegEx expression [...]- i don't suggest to use libaries but you should look at thehtmlpurifierlib.