Understanding character encoding in PHP

Question

I am struggling at understanding character encoding in PHP.

Consider the following script (you can run it here):

$string = "\xe2\x82\xac";

var_dump(mb_internal_encoding());
var_dump($string);
var_dump(unpack('C*', $string));
$utf8string = mb_convert_encoding($string, "UTF-8");
var_dump($utf8string);
var_dump(unpack('C*', $utf8string));

mb_internal_encoding("UTF-8");

var_dump($string);
var_dump($utf8string);

I have a string, actually the € character, represented with its unicode code points. Up to PHP 5.5 the used internal encoding is ISO-8859-1, hence I think that my string will be encoded using this encoding. With unpack I can see the bite representation of my string, and it corresponds to the hexadecimal codes I use to define the string.

Then I convert the encoding of the string to UTF-8, using mb_convert_encoding. At this point the string displays differently on the screen and its byte representation changes (and this is expected).

If I change the PHP internal encoding also to UTF-8, I'd expect utf8string to be displayed correctly on the screen, but this doesn't happen.

What I am missing?

Are you running this in a browser? That will make its own choice about character encoding and you'd be better off entity escaping the offending character. — Chris
– Chris, Commented Apr 19, 2016 at 20:24
Add a <meta charset="UTF-8"> to the <head> to make sure the browser is also expecting UTF8 — RiggsFolly
– RiggsFolly, Commented Apr 19, 2016 at 20:27
All strings in PHP are treated as binary strings, and mb_internal_encoding() affects literally nothing other than how other mb_* functions work. You also can't output both ISO-8859 and UTF8 within the same document and expect anything sane to happen regardless of what program is generating the output. — Sammitch
– Sammitch, Commented Apr 19, 2016 at 20:41

Community · Accepted Answer · 2017-05-23 12:16:12Z

2

The script you show doesn't use any non-ascii characters, so its internal encoding does not make any difference. mb_internal_encoding does convert your data on output. This question will tell you more about how it works; it will also tell you it's better not to use it.

The three-byte string $string in your code is the UTF-8 representation of the Euro symbol, not its "unicode code point" (which is 2 bytes wide, like all common Unicode characters: 0x20ac).

Does this clear up the behavior you see?

edited May 23, 2017 at 12:16

CommunityBot

11 silver badge

answered Apr 19, 2016 at 20:27

alexis

50.4k18 gold badges108 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

axiac · Accepted Answer · 2016-04-19 20:49:10Z

You started with a string that is the utf-8 representation of the Euro symbol. If you run echo($string) all versions of PHP produce the three bytes you put in $string. How they are interpreted by the browser depends on the character set specified in the Content-Type header. If it is text/html; charset=utf-8 then you get the Euro sign in the rendered page.

Then you do the wrong move. You call mb_convert_encoding() with only two arguments. This lets PHP use the current value of its internal encoding used by the mb_string extension for the the third argument ($from_encoding). Why?

For PHP 5.6 and newer, the default value returned by mb_internal_encoding() is utf-8 and the call to mb_convert_encoding() is a no-op.

But for previous versions of PHP, the default value returned by mb_internal_encoding() is iso-8859-1 and it doesn't match the encoding of your string. Accordingly, mb_convert_encoding() interprets the bytes of $string as three individual characters and encodes them using the rules of utf-8. The outcome is obviously wrong.

Btw, if you initialize $string with '€' you get the same output on all PHP versions (even on PHP 4, iirc).

Collectives™ on Stack Overflow

Understanding character encoding in PHP

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related