0

Am getting HTML from cURL in TWO websites.

SITE 1: https://xperia.sony.jp/campaign/360RA/?s_tc=somc_co_ext_docomo_360RA_banner

SITE 2: https://www.fidelity.jp/fwe-top/?utm_source=outbrain&utm_medium=display&utm_campaign=similar-gdw&utm_content=FS001&dicbo=v1-b6eb7c5f86a6978bba74e3703a046886-00d8ad90c4cb65b2bdcc239bcccf5ec378-mnrtcytfgu4toljwgjrwgljumu4wmljzg5tgkljxgzsdgzbqmyzwenbsgy

My cURL looks like:

$ua= "Mozilla/5.0 (X11; Linux i686; rv:36.0) Gecko/20100101 Firefox/36.0 SeaMonkey/2.33.1";     
$options = array(
                CURLOPT_RETURNTRANSFER => true, // return web page
                CURLOPT_FAILONERROR => true, 
                CURLOPT_FOLLOWLOCATION => true, // follow redirects
                CURLOPT_ENCODING => "", // handle all encodings 
                CURLOPT_USERAGENT => $ua, // who am i
                
                       
                CURLOPT_AUTOREFERER => true, // set referer on redirect
                CURLOPT_CONNECTTIMEOUT => 10, // timeout on connect
                CURLOPT_TIMEOUT => 10, // timeout on response
                CURLOPT_MAXREDIRS => 5,
                CURLOPT_FORBID_REUSE, true);
        
        $ch = curl_init($url);
            curl_setopt_array($ch, $options);
            $content = curl_exec($ch);

         //Use xPath or str_get_html($content) to parse

The FIRST URL opens perfectly encoded and shows characters as expected

Exaple: $title_string = $html->find("title",0)->plaintext shows the <title> tag text and characters well encoded

The SECOND URL shows SQUARE BOXES ¤ããªãããi��Ɨ� . But when you do utf8_decode( $title_string), then this SECOND URL will show well encoded characters as expected.

The problem is, when you use utf8_decode( $title_string), the FIRST URL now shows SQUARE BOXES.

Is there a way to have a universal way of solving this issue?

I have tried

$charset=  mb_detect_encoding($str);
    if( $charset=="UTF-8" ) {
        return utf8_decode($str);
    }
    else {
        return $str;
    }

Seems both Strings are being encoded as UTF-8 by cURL. One works, the other shows square boxes.

I have also tried

php curl response encoding

Strange behaviour when encoding cURL response as UTF-8

Replace unicode character

https://www.php.net/manual/en/function.mb-convert-encoding.php

Which charset should i use for multilingual website?

French and Chinese characters are not appearing correctly

And many more

I have spend critical hours trying to solve this. Any idea is welcome

5
  • xperia site contains explicit <head> <meta charset="utf-8"> … while fidelity does not? Commented Aug 27, 2021 at 20:15
  • a way to satisfactorily encode both to UTF-8?? T-you! Commented Aug 27, 2021 at 20:23
  • CURLOPT_ENCODING => 'UTF-8'? Commented Aug 27, 2021 at 20:49
  • I still get this see link ctrlv.link/CV8A with adding CURLOPT_ENCODING => 'UTF-8' Commented Aug 27, 2021 at 20:54
  • CURLOPT_ENCODING is about the content-encoding, so totally unrelated here Commented Aug 27, 2021 at 21:12

2 Answers 2

2

Both pages are UTF-8 encoded, and cURL returns that as is. The problem is the following processing; assuming that libxml2 is involved, it tries to guess the encoding from <meta> elements, but if there are none, it assumes ISO-8859-1. It can be forced to assume UTF-8, if an UTF-8 BOM ("\xEF\xBB\xBF") is preprended to the HTML.

Sign up to request clarification or add additional context in comments.

1 Comment

This SAVED my skin. Thank you very much
0

As mentioned by @cmb in the answer above, for those who would like to see my Final code in full details. Here you go

$url = "https://stackoverflow.com/
 
$html = str_get_html($url);

libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings

    $doc = new DomDocument();
    $doc->loadHTML("\xEF\xBB\xBF$html"); // This is where and how you put the BOM
    $xpath = new DOMXPath($doc);
    $query = '//*/meta[starts-with(@property, \'og:\')]';
    $metas = $xpath->query($query);
    $rmetas = array();

    foreach ($metas as $meta) {
        $property = $meta->getAttribute('property');
        $content = $meta->getAttribute('content');
        $rmetas[$property] = $content;
    }

    var_dump($rmetas);

Hope it helps someone in the same peril.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.