I am struggling at understanding character encoding in PHP.
Consider the following script (you can run it here):
$string = "\xe2\x82\xac";
var_dump(mb_internal_encoding());
var_dump($string);
var_dump(unpack('C*', $string));
$utf8string = mb_convert_encoding($string, "UTF-8");
var_dump($utf8string);
var_dump(unpack('C*', $utf8string));
mb_internal_encoding("UTF-8");
var_dump($string);
var_dump($utf8string);
I have a string, actually the € character, represented with its unicode code points. Up to PHP 5.5 the used internal encoding is ISO-8859-1, hence I think that my string will be encoded using this encoding. With unpack I can see the bite representation of my string, and it corresponds to the hexadecimal codes I use to define the string.
Then I convert the encoding of the string to UTF-8, using mb_convert_encoding. At this point the string displays differently on the screen and its byte representation changes (and this is expected).
If I change the PHP internal encoding also to UTF-8, I'd expect utf8string to be displayed correctly on the screen, but this doesn't happen.
What I am missing?
<meta charset="UTF-8">to the<head>to make sure the browser is also expecting UTF8mb_internal_encoding()affects literally nothing other than how othermb_*functions work. You also can't output both ISO-8859 and UTF8 within the same document and expect anything sane to happen regardless of what program is generating the output.