4

I am looking to scrap a Chinese website using PHP and CURL. Earlier I had an issue with the compressed results and SO had helped me to sort it out. Now I'm facing a trouble while parsing the contents through PHP - DOMDocument. The error is as follows,

Warning: DOMDocument::loadHTML(): input conversion failed due to input error, bytes 0xE3 0x80 0x90 0xE8 in /var/www/html/ ..

Even though warning this is preventing from getting further results.

My code is as given below:

$agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0';
$curl = curl_init(); 
curl_setopt($curl, CURLOPT_URL,$url); 
curl_setopt($curl, CURLOPT_HTTPHEADER, array('text/html; charset=gb2312')); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);  
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_ENCODING, "");  // handling all compressions 
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
curl_setopt($curl, CURLOPT_TIMEOUT, 1000);
$html = curl_exec($curl) or die("error: ".curl_error($curl));
curl_close($curl);
$htmlParsed = mb_convert_encoding($result,'utf-8','gb2312');

$doc = new DOMDocument();
$doc->loadHTML($htmlParsed);

$xpath = new DOMXpath($doc);

$elements = $xpath->query('//div[@class="test"]//a/@href');

if (!is_null($elements)) {
  foreach ($elements as $element) {
    echo "<br/>[". $element->nodeName. "]";

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeValue. "\n";
    }
  }
}

I found the content type in my target website as ,

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

So I tried converting result to utf-8.

Since the input conversion fails at 'DOMDocument::loadHTML()' line of the code ,I can't parse the web page to get the results. I am currently stuck at this point and any help or suggestions will be highly appreciated. Thanx in advance.

(Earlier I used to work with simple HTML DOM parser,which was pretty simple.But later after reading the cons in SO regarding its usage.I planned to switch to PHP's native DOM Parser )

7
  • Try to suppress errors while loading the HTML, i.e. @$doc->loadHTML($htmlParsed);, this is maybe the only time when suppressing errors is acceptable, because the PHP DOM is very, very pernickety and try not to convert the page, just load it as it is, then try to eliminate the next issue (if any). Commented Apr 29, 2014 at 9:20
  • Yes,I had tried suppressing the errors,but it could not yield me the results. Commented Apr 29, 2014 at 9:24
  • Check also your XPath query, try to get something very simple, then move on to the next element. Commented Apr 29, 2014 at 9:27
  • @bodi0 Yes ,I tried with some very simple tags.But no luck.! :( Commented Apr 29, 2014 at 9:38
  • Read about this (bugs.php.net/bug.php?id=47108&edit=3) PHP bug, which version of PHP you use? And, you can try PHP tidy (php.net/manual/en/intro.tidy.php) Commented Apr 29, 2014 at 10:23

3 Answers 3

3

I see a solution today .

$html=new DOMDocument();  
$html_source    = get_html();
$html_source    =mb_convert_encoding( $html_source, "HTML-ENTITIES", "UTF-8");
$html->loadHTML( $html_source );
Sign up to request clarification or add additional context in comments.

Comments

2

Without seeing the full head of the document that you are parsing I can only guess, but if the with the character encoding data does not come directly after the tag, you may be running into a situation where DomDocument is using its default of ISO-8859-1 and running into the【 character (the first three "invalid" bytes in gb2312) of which the 0x80 byte would be the first bit of nonsense since this is an unused code point in ISO-8859-1. This would likely trigger the bug in DomDocument discussed in the comments above. And could easily happen if the element is included before the content-type meta information.

The only thing I can think of to try would be to run the html through a bit of prep and move that content-type meta tag to right after the tag to try to make it use the correct character set. If you use mb_convert_encoding or iconv to convert the encoding to iso-5589-1 or utf-8, make sure that you modify the meta information because DomDocument is, unfortunately, brittle in many ways.

Comments

2
<?php
$contents = file_get_contents('xml.xml');
function convert_utf8( $string ) { 
    if ( strlen(utf8_decode($string)) == strlen($string) ) {   
        // $string is not UTF-8
        return iconv("ISO-8859-1", "UTF-8", $string);
    } else {
        // already UTF-8
        return $string;
    }
}

$contents = mb_convert_encoding( $contents, mb_detect_encoding($contents), "UTF-8");

$xml = simplexml_load_string(convert_utf8($contents));
print_r($xml);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.