2

I have been trying to save the source code of a section of a webpage using PHP. When I extract the content of whole webpage, the source code order is preserved but when I try to get part of the document using

$dom = new DOMDocument;
$dom->loadHTML($webpage);
$xpath = new DOMXPath($dom);

$query_tag = "//div[contains(@class, 'class-name')]";
$result = $dom->saveHTML($xpath->query($query_tag)->item(0));

The script tag gets messed up. Until now, this is the only website where this issue occurred. Are there some limitations of saveHTML function that I am not aware of?

This is what I should be receiving:

<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
        var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:\/\/www.searchenginejournal.com\/wp-content\/plugins\/abm-sej\/includes\/category-images\/SPS_128.png","sponsor_text":"<div class=\"taxonomy-description\">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel=\"nofollow\" href=\"http:\/\/sejr.nl\/PowerSuite-2016-5\" onClick=\"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');\" target=\"_blank\">your local SEO strategy<\/a> up to par.<\/div>","logo_url":"http:\/\/sejr.nl\/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}            
        $('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
                     $('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onClick="__gaTracker(\'send\', \'event\', \'Sponsored Category Click Var 1\', \''+cat_head_params.ga_labels[0]+'\', \''+cat_head_params.ga_labels[0]+'\');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96" /></a>');
                                   $('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
         $('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
         $('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);


});</script> </div>

This is what I actually get:

<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
        var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:\/\/www.searchenginejournal.com\/wp-content\/plugins\/abm-sej\/includes\/category-images\/SPS_128.png","sponsor_text":"<div class=\"taxonomy-description\">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel=\"nofollow\" href=\"http:\/\/sejr.nl\/PowerSuite-2016-5\" onClick=\"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');\" target=\"_blank\">your local SEO strategy<\/a> up to par.<\/div>","logo_url":"http:\/\/sejr.nl\/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}            
        $('#sponsored-category-header').append('<div class="sponsored-category-logo"></script>


</div>');
                     $('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onclick="__gaTracker(\'send\', \'event\', \'Sponsored Category Click Var 1\', \''+cat_head_params.ga_labels[0]+'\', \''+cat_head_params.ga_labels[0]+'\');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96"></a>');
                                   $('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
         $('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
         $('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);


    }); </div>

In case you missed it, the ending script tag has moved up a few lines.

Just to be clear, I am not talking about rendered HTML. I am talking about the actual source code that I get after making the request. Any help on how to resolve this issue will be appreciated.

I know that the function saveHTML is causing the issue because when I echo the whole page through PHP, every tag is in the right place.

5
  • DOMDocument is a proper HTML parser so it cannot handle the invalid tag soup you often find in the wild. Just like your browser, it'll fix the HTML the best it can. Commented Oct 5, 2016 at 18:21
  • @ÁlvaroGonzález So, the source code gets messed up after $dom->loadHTML($webpage);? Commented Oct 5, 2016 at 18:23
  • Correct. I haven't had the chance to inspect the site but, if there's invalid markup (I'm not saying whether this is the case here or not, thus I'm leaving a comment rather than an answer), it gets fixed right then because PHP needs to operate with a memory representation of the document tree (as I said, that's what any browser does). The source code is only that, a source Commented Oct 5, 2016 at 18:32
  • I don't see anything invalid on the page in the section you're interested in, but I'd be hesitant to blame saveHTML specifically without first breaking down the process and inspecting each step to eliminate loadHTML, DOMXPath, and query first. Commented Oct 5, 2016 at 18:40
  • Thanks @LinuxDisciple How can the query mess up? This is the query that I used //div[contains(@class, 'post-data')]. It gets the first result which contains other markup besides the script tag. Is there some way to check if either loadHTML or DOMXPath is the culprit here? Commented Oct 5, 2016 at 18:51

1 Answer 1

1

First of all, your code should be triggering a good bunch of warnings like these:

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity
Warning: DOMDocument::loadHTML(): Unexpected end tag : strong in Entity
Warning: DOMDocument::loadHTML(): Tag header invalid in Entity

This is to expect with on-the-wild HTML (and this page's code is nor particularly bad) but you haven't even mentioned it, what makes me suspect that you might not have error reporting enabled in your development box.

Additionally, the page has huge amounts of JavaScript and DOMDocument is just an HTML parser.

With that, we can get a clear picture of what's happening. Since DOMDocument is not a full-fledged browser it doesn't understand JavaScript code. That means that it detects the <script> tag but it doesn't handle its contents as JavaScript—it merely looks for a closing tag and the first one he finds is this:

$('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
                                                                             ^^^^^^

It doesn't know that it's a JavaScript string and should be ignored. Instead, it thinks the wrong tag is being closed so it attempts to fix what's technically invalid HTML and adds the missing </script> tag.

For this precise reason, the <script>...</script> tag set has traditionally been written this way:

<script type="text/javascript"><!--
var foo = '<p>Escaped end tag<\/p>';
//--></script>

... so user agents that are unaware of JavaScript can safely ignore the whole tag (hey, it's nothing but a good old HTML comment). However, nowadays it's almost universally considered bad practice because "all browsers understand JavaScript".

Final note: the DOM extension is probably aware of the <script> tag and knows it isn't allowed to have other tags inside. That explains why inner opening tags are not considered.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Alvaro :). That explains it!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.