1

In my code I am trying to fetch entire in HTML codes and ignore all JavaScripts (AdSense Code) from my old website. I have about 800 pages and its hard for me to copy one by one. The main problem I am facing is my Xpath is too long and it gives me an error every time and secondly it only prints the text instead of HTML code. I don't know how to resolve it.

My XPath

/html/body/div/div/div/div[4]/table/tbody/tr/td/div/h2/table/tbody/tr/td/div[1]/table/tbody/tr/td[1]/div/table/tbody/tr/td/div/table/tbody/tr/td/div/table/tbody/tr/td/div

Errors I am getting are available at https://pastebin.com/FFRLr3vq

My Current PHP Code

error_reporting(E_ERROR);
$urls[] = "http://myoldwebsite.com/somepage.html";

function curlload($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
        $source = curl_exec($ch);
        return $source;
}

foreach ($urls as $url) {
$source = curlLoad($url);
@$doc = new DOMDocument();
@$doc->loadHTML($source);   

$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//div[@class='pageContent']");

// To check the result:
echo "<p>" . $node->nodeValue . "</p>";
}
10
  • Does the table have any attributes you an attach onto? Can you please post the table source? That would help me help you better. Commented Sep 11, 2017 at 15:25
  • @IamBatman can you please review my update php code Commented Sep 11, 2017 at 15:27
  • @Rtra offtopic: You should rename your function to curlLoad or call it as curlload - but don't mix the case. Aswell you should not use @ to suppress errors. That is bad practice. Commented Sep 11, 2017 at 15:29
  • @Rtra ontopic: The errors simply tell you that you are trying to load invalid HTML-markup, meaning the error is not in this code but in the $source file. Commented Sep 11, 2017 at 15:30
  • @Xatenev thanks for your suggestion I renamed the function Commented Sep 11, 2017 at 15:31

1 Answer 1

1

To output the loaded HTML you can use

http://php.net/manual/de/domdocument.savehtml.php

To remove script tags (as discussed in the chat), you can use something like that:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

Source & more info: remove script tag from HTML content

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.