I have been trying to save the source code of a section of a webpage using PHP. When I extract the content of whole webpage, the source code order is preserved but when I try to get part of the document using
$dom = new DOMDocument;
$dom->loadHTML($webpage);
$xpath = new DOMXPath($dom);
$query_tag = "//div[contains(@class, 'class-name')]";
$result = $dom->saveHTML($xpath->query($query_tag)->item(0));
The script tag gets messed up. Until now, this is the only website where this issue occurred. Are there some limitations of saveHTML function that I am not aware of?
This is what I should be receiving:
<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:\/\/www.searchenginejournal.com\/wp-content\/plugins\/abm-sej\/includes\/category-images\/SPS_128.png","sponsor_text":"<div class=\"taxonomy-description\">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel=\"nofollow\" href=\"http:\/\/sejr.nl\/PowerSuite-2016-5\" onClick=\"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');\" target=\"_blank\">your local SEO strategy<\/a> up to par.<\/div>","logo_url":"http:\/\/sejr.nl\/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}
$('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
$('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onClick="__gaTracker(\'send\', \'event\', \'Sponsored Category Click Var 1\', \''+cat_head_params.ga_labels[0]+'\', \''+cat_head_params.ga_labels[0]+'\');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96" /></a>');
$('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
$('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
$('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);
});</script> </div>
This is what I actually get:
<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:\/\/www.searchenginejournal.com\/wp-content\/plugins\/abm-sej\/includes\/category-images\/SPS_128.png","sponsor_text":"<div class=\"taxonomy-description\">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel=\"nofollow\" href=\"http:\/\/sejr.nl\/PowerSuite-2016-5\" onClick=\"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');\" target=\"_blank\">your local SEO strategy<\/a> up to par.<\/div>","logo_url":"http:\/\/sejr.nl\/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}
$('#sponsored-category-header').append('<div class="sponsored-category-logo"></script>
</div>');
$('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onclick="__gaTracker(\'send\', \'event\', \'Sponsored Category Click Var 1\', \''+cat_head_params.ga_labels[0]+'\', \''+cat_head_params.ga_labels[0]+'\');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96"></a>');
$('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
$('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
$('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);
}); </div>
In case you missed it, the ending script tag has moved up a few lines.
Just to be clear, I am not talking about rendered HTML. I am talking about the actual source code that I get after making the request. Any help on how to resolve this issue will be appreciated.
I know that the function saveHTML is causing the issue because when I echo the whole page through PHP, every tag is in the right place.
DOMDocumentis a proper HTML parser so it cannot handle the invalid tag soup you often find in the wild. Just like your browser, it'll fix the HTML the best it can.$dom->loadHTML($webpage);?loadHTML,DOMXPath, andqueryfirst.//div[contains(@class, 'post-data')]. It gets the first result which contains other markup besides the script tag. Is there some way to check if eitherloadHTMLorDOMXPathis the culprit here?