3

I'm developing PHP applications for quite a while now. But this one realy gets me struggled. I’m loading complete HTML pages using the DomDocument. These pages are external and may contain JavaScript. This is beyond my control.

On some pages things were not rendered the way it supposed to when it came down to basic HTML formatting in JavaScript strings. I've wrote down an example which explains it all.

<?php
$html = new DOMDocument();

libxml_use_internal_errors(true);

$strPage = '<html>
<head>
<title>Demo</title>
<script type="text/javascript">
var strJS = "<b>This is bold.</b><br /><br />This should not be bold. Where did my closing tag go to?";
</script>
</head>
<body>
<script type="text/javascript">
document.write(strJS);
</script>
</body>
</html>';

$html->loadHTML($strPage);
echo $html->saveHTML();
exit;
?>

Am I missing something?

Edit: I've changed the demo. Changing the LoadHTML to LoadXML doesn't work anymore now and the output of the demo will pass w3c validation. Also adding the CDATA block to the JavaScript doesn't seem to have any effect.

5
  • Are you missing something? Yes -> "Warning: DOMDocument::loadHTML(): Unexpected end tag : b in Entity..." So the problem is that loadHTML is eating tags inside of your script. Doesn't answer your question but perhaps alleviates a bit of the mystery. Commented Jul 4, 2014 at 13:39
  • Yes, thank you. That's exactly what's this demo is about. Why is it eating the </b> tag? Commented Jul 4, 2014 at 17:11
  • I don't know why. You can avoid it by backslash-escaping the slashes in your closing tags contained in javascript strings, eg var strJS = "<b>This is bold.<\/b>... Commented Jul 4, 2014 at 17:34
  • Tested and you're right about that. The only problem then is that I normally don't have any control over the (external) HTML that is loaded into the DOM. Could it be a bug in the loadHTML implementation, or is there a hidden option that needs to be turned on to make this work? Commented Jul 4, 2014 at 17:43
  • @James: jibbering.com: "...an HTML parser is required to take the first [...] "</" [...] as marking the end of the script element.". HTML5: "...always escape "<!--" as "<\!--", "<script" as "<\script", and "</script" as "<\/script" [...] parsing of script blocks in HTML is a strange and exotic practice...". Commented Jul 4, 2014 at 19:37

2 Answers 2

5
+200

Adding LIBXML_SCHEMA_CREATE to loadHTML() options will fix the issue.

<?php
$html = new DOMDocument();

libxml_use_internal_errors(true);

$strPage = '<html>
<head>
<title>Demo</title>
<script type="text/javascript">
var strJS = "<b>This is bold.</b><br /><br />This should not be bold. Where did my closing tag go to?";
</script>
</head>
<body>
<script type="text/javascript">
document.write(strJS);
</script>
</body>
</html>';

$html->loadHTML($strPage, LIBXML_HTML_NODEFDTD | LIBXML_SCHEMA_CREATE);
echo $html->saveHTML();
exit();


?>
Sign up to request clarification or add additional context in comments.

Comments

2

I dont know why (tried to find out), but it works if you load the HTML using loadXML instead of loadHTML

$html = new DOMDocument();

libxml_use_internal_errors(true);

$strPage = "<html><head>";
$strPage .= "<script type=\"text/javascript\">";
$strPage .= "var strJS = \"<b>This is bold.</b><br /><br />This should not be bold. Where did my closing tag go to?\";";
$strPage .= "</script>";
$strPage .= "<body>";
$strPage .= "<script type=\"text/javascript\">";
$strPage .= "document.write(strJS);";
$strPage .= "</script>";
$strPage .= "</body>";
$strPage .= "</head></html>";

$html->loadXML($strPage);

echo $html->saveHTML();

Though the HTML is actually invalid, everything is in the head.

2 Comments

I've changed my example. Changing the loadHTML to loadXML doesn't work anymore. Due the invalid HTML it actually kind of validated as valid XML.
@Arjoes Your example code updated, will work if you use loadXML instead of loadHTML, I know its not ideal and counter intuitive, but I simply don't think DOMDocument sees <script> tags like html tags, nor will you be able to extract elements from the js as if its executed. What do you actually want todo with the string as if your not manipulating the HTML or extracting from it, just echo'ing DOMDocument is the wrong tool. Sorry couldn't give a proper solution :(

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.