3

I should remove all content (and tags) between tags in a PHP string fetched from file_get_contents of a generic website URL. I'm using the RegEx expression:

preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $string);

It works fine, but my problem is that, if a script contains the CDATA tag, it won't work. An example of string would be:

<script type='text/javascript'>
/* <![CDATA[ */
var variable = {"ajax":"....."}
/* ]]> */
</script>

I guess that the problem is with those "/" and "/" tags.


I've already searched on google and on Stack Overflow, but ther is no question with that particular type of cdata tag (with /* and */), so nothing works.

Any suggestion?

Edit: As Steve answered, i am now using a code like this:

foreach($dom->getElementsByTagName('script') as $scripttag){
$scripttag->parentNode->removeChild($scripttag);
}

And then i have:

foreach($dom->getElementsByTagName('ins') as $string) {
    $string2 .= $string->nodeValue;
    $string2 .= ' ';
}

But that returns a $string2 with script tags inside.

EDIT 2 (SOLVED): With Steve's help, I found out that using Xpath solves the problem:

$xpath = new DOMXpath($dom);
foreach ($xpath->query('//script') as $node) {
   $node->parentNode->removeChild($node);
}

That removes script tags also inside another tag, for example:

<ins><script>First JS</script></ins>
<ins>Hello</ins>
<script>Second JS</script>

Will output

Hello

Thank you all for the help!

6
  • 1
    regex for html parsing isn't a good idea. And don't forget to remove <img onload="hack();" /> Commented Nov 18, 2015 at 14:16
  • What is the problem? I see it works "nicely" (of course, with the provided example only). Commented Nov 18, 2015 at 14:16
  • 1
    @stribizhev that is the problem parsing html with regex, as from an attacker view, i don't follow the rules.... regex101.com/r/zV1yA2/1 Commented Nov 18, 2015 at 14:21
  • Hi, thanks for the advices, but as i answered to Steve, i'm using also a DOMDocument, but don't know if it is possible to re-use it after deleting content... Commented Nov 18, 2015 at 14:31
  • then please update your question [...] I'm using the RegEx expression [...] - i don't suggest to use libaries but you should look at the htmlpurifier lib. Commented Nov 18, 2015 at 14:34

1 Answer 1

3

Dont use regex for this, use a proper html parser like domdocument:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);
//removing elements from a nodelist resets the internal pointer, so traverse backwards:
$elements = $dom->getElementsByTagName('script');
$count = $elements->length;
while(--$count){
    $elements->item($count)->parentNode->removeChild($elements->item($count));
}

//you can do further dom manipulation here if needed
$insertContents='';
foreach($dom->getElementsByTagName('ins') as $insert){
    $insertContents .= $insert->nodeValue . ' ';
}
//if you need the complete html at all:
$html = $dom->saveHTML();
//your desired string:
echo $insertContents;
Sign up to request clarification or add additional context in comments.

6 Comments

Hi, thanks for the answer, but i already use the DOMDocument for a $dom->getElementsByTagName('ins'); Is it possible to use another function instead of $dom->saveHTML(); to save the new dom object and re-use it for the other getElements? (sorry for my ignorance about dom)
You can just keep using the same instance - you can call getElementsByTagName as many times as you like and only call saveHTML when you havge finished processing
Hi Steve, sorry for my insistence, but if i use that code like you said, it keeps the script tags in. In particular, what i do is foreach($dom->getElementsByTagName('ins') as $string) { $string2 .= $string->nodeValue; $string2 .= ' '; } But that returns a $string2 with script tags again... Sorry again for my ignorance.
@TekLitto Oops, sorry. My mistake. Turns out you need to be careful when removing elements whilst traversing them - see update
Thanks @Steve, now it works very well, but there is another little problem, if i have something like <ins><script>First JS</script></ins><script>Second JS</script> the echo $insertContents; will output "First JS", is it solvable?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.