1

I am using PHP's DOMDocument class with HTML 5 document. But when I do, some utf-8 characters are "changed". I got  , ’, é etc....

Here is my code.

    $parsedUrl = 'http://www.futursparents.com/';

    $curl = curl_init();
    @curl_setopt_array($curl, [
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_TIMEOUT => 60,
            CURLOPT_CONNECTTIMEOUT => 30,
            CURLOPT_FOLLOWLOCATION => TRUE,
            CURLOPT_MAXREDIRS => 5,
            CURLOPT_AUTOREFERER => FALSE,
            CURLOPT_HEADER => TRUE, // FALSE
            CURLOPT_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
            CURLOPT_REDIR_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
            CURLOPT_CERTINFO => TRUE,
            CURLOPT_LOW_SPEED_LIMIT => 200,
            CURLOPT_LOW_SPEED_TIME => 50,
            CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
            CURLOPT_PROXYTYPE => CURLPROXY_HTTP,
            CURLOPT_ENCODING => 'gzip,deflate',
            CURLOPT_URL => $parsedUrl,
        ]);
    $response = curl_exec($curl);
    $info = curl_getinfo($curl);
    $error = curl_error($curl);
    $headers = trim(substr($response, 0, curl_getinfo($curl, CURLINFO_HEADER_SIZE)));
    $content = substr($response, curl_getinfo($curl, CURLINFO_HEADER_SIZE));

    curl_close($curl);

    libxml_use_internal_errors(true);

    $domDoc = new DOMDocument();
    print_r($domDoc->encoding); // It's OK => UTF-8
    // Got   or s’ or &eacute etc....
    print_r($domDoc->saveHTML());

It seem to be an HTML5 doctype with a meta element like so <meta charset=utf-8">

If I add the charset meta tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8">, It's seem to be OK.

$domDoc->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $content);
// No &ensp; or s&rsquo; or &eacute etc....
print_r($domDoc->saveHTML());

Do you think this is the right solution?

0

1 Answer 1

4

I found why.

The DOM extension was built on libxml2 whose HTML parser was made for HTML 4. If an HTML5 doctype and a meta element like so <meta charset="utf-8"> HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities.

However the HTML4-like version will work <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Reference: UTF-8 with PHP DOMDocument loadHTML?

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.