1

I am struggling at understanding character encoding in PHP.

Consider the following script (you can run it here):

$string = "\xe2\x82\xac";

var_dump(mb_internal_encoding());
var_dump($string);
var_dump(unpack('C*', $string));
$utf8string = mb_convert_encoding($string, "UTF-8");
var_dump($utf8string);
var_dump(unpack('C*', $utf8string));

mb_internal_encoding("UTF-8");

var_dump($string);
var_dump($utf8string);

I have a string, actually the € character, represented with its unicode code points. Up to PHP 5.5 the used internal encoding is ISO-8859-1, hence I think that my string will be encoded using this encoding. With unpack I can see the bite representation of my string, and it corresponds to the hexadecimal codes I use to define the string.

Then I convert the encoding of the string to UTF-8, using mb_convert_encoding. At this point the string displays differently on the screen and its byte representation changes (and this is expected).

If I change the PHP internal encoding also to UTF-8, I'd expect utf8string to be displayed correctly on the screen, but this doesn't happen.

What I am missing?

3
  • 1
    Are you running this in a browser? That will make its own choice about character encoding and you'd be better off entity escaping the offending character. Commented Apr 19, 2016 at 20:24
  • Add a <meta charset="UTF-8"> to the <head> to make sure the browser is also expecting UTF8 Commented Apr 19, 2016 at 20:27
  • All strings in PHP are treated as binary strings, and mb_internal_encoding() affects literally nothing other than how other mb_* functions work. You also can't output both ISO-8859 and UTF8 within the same document and expect anything sane to happen regardless of what program is generating the output. Commented Apr 19, 2016 at 20:41

2 Answers 2

2

The script you show doesn't use any non-ascii characters, so its internal encoding does not make any difference. mb_internal_encoding does convert your data on output. This question will tell you more about how it works; it will also tell you it's better not to use it.

The three-byte string $string in your code is the UTF-8 representation of the Euro symbol, not its "unicode code point" (which is 2 bytes wide, like all common Unicode characters: 0x20ac).

Does this clear up the behavior you see?

Sign up to request clarification or add additional context in comments.

Comments

1

You started with a string that is the utf-8 representation of the Euro symbol. If you run echo($string) all versions of PHP produce the three bytes you put in $string. How they are interpreted by the browser depends on the character set specified in the Content-Type header. If it is text/html; charset=utf-8 then you get the Euro sign in the rendered page.

Then you do the wrong move. You call mb_convert_encoding() with only two arguments. This lets PHP use the current value of its internal encoding used by the mb_string extension for the the third argument ($from_encoding). Why?

For PHP 5.6 and newer, the default value returned by mb_internal_encoding() is utf-8 and the call to mb_convert_encoding() is a no-op.

But for previous versions of PHP, the default value returned by mb_internal_encoding() is iso-8859-1 and it doesn't match the encoding of your string. Accordingly, mb_convert_encoding() interprets the bytes of $string as three individual characters and encodes them using the rules of utf-8. The outcome is obviously wrong.

Btw, if you initialize $string with '€' you get the same output on all PHP versions (even on PHP 4, iirc).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.