I am looking to scrap a Chinese website using PHP and CURL. Earlier I had an issue with the compressed results and SO had helped me to sort it out. Now I'm facing a trouble while parsing the contents through PHP - DOMDocument. The error is as follows,
Warning: DOMDocument::loadHTML(): input conversion failed due to input error, bytes 0xE3 0x80 0x90 0xE8 in /var/www/html/ ..
Even though warning this is preventing from getting further results.
My code is as given below:
$agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL,$url);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('text/html; charset=gb2312'));
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_ENCODING, ""); // handling all compressions
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
curl_setopt($curl, CURLOPT_TIMEOUT, 1000);
$html = curl_exec($curl) or die("error: ".curl_error($curl));
curl_close($curl);
$htmlParsed = mb_convert_encoding($result,'utf-8','gb2312');
$doc = new DOMDocument();
$doc->loadHTML($htmlParsed);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[@class="test"]//a/@href');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
I found the content type in my target website as ,
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
So I tried converting result to utf-8.
Since the input conversion fails at 'DOMDocument::loadHTML()' line of the code ,I can't parse the web page to get the results. I am currently stuck at this point and any help or suggestions will be highly appreciated. Thanx in advance.
(Earlier I used to work with simple HTML DOM parser,which was pretty simple.But later after reading the cons in SO regarding its usage.I planned to switch to PHP's native DOM Parser )
@$doc->loadHTML($htmlParsed);, this is maybe the only time when suppressing errors is acceptable, because the PHP DOM is very, very pernickety and try not to convert the page, just load it as it is, then try to eliminate the next issue (if any).XPathquery, try to get something very simple, then move on to the next element.