1

I'm trying to get the UTF-8 bytes (in decimal) of a unicode string. For instance:

function unicode_to_utf8_bytes($string) {

}

$text = 'Hello 😀';
$result = unicode_to_utf8_bytes($text);

var_dump($result);

array(10) {
  [0]=>
  int(72)
  [1]=>
  int(101)
  [2]=>
  int(108)
  [3]=>
  int(108)
  [4]=>
  int(111)
  [5]=>
  int(32)
  [6]=>
  int(240)
  [7]=>
  int(159)
  [8]=>
  int(152)
  [9]=>
  int(128)
}

An example of the result can be seen here:

http://apps.timwhitlock.info/unicode/inspect?s=Hello+%F0%9F%98%80

I feel I'm close, this is what I managed to get:

function utf8_char_code_at($str, $index) {

    $char = mb_substr($str, $index, 1, 'UTF-8');

    if (mb_check_encoding($char, 'UTF-8')) {
        $ret = mb_convert_encoding($char, 'UTF-32BE', 'UTF-8');
        return hexdec(bin2hex($ret));
    }
    else
        return null;

}

function unicode_to_utf8_bytes($str) { 

    $result = array();

    for ($i=0; $i<mb_strlen($str, '8bit'); $i++)
        $result[] = utf8_char_code_at($str, $i);

    return $result;

}

$string = 'Hello 😀';

var_dump(unicode_to_utf8_bytes($string));

array(10) {
  [0]=>
  int(72)
  [1]=>
  int(101)
  [2]=>
  int(108)
  [3]=>
  int(108)
  [4]=>
  int(111)
  [5]=>
  int(32)
  [6]=>
  int(128512)
  [7]=>
  int(0)
  [8]=>
  int(0)
  [9]=>
  int(0)
}

Any help will be much appreciated!

2
  • 1
    Sorry, but it is unclear what you are actually trying to do... UTF-8 is one possible representation of unicode characters, others do exist. Therefore a "conversion from unicode to UTF-8" does not really make sense. So what do you actually mean when you say "unicode"? What do you mean by "UTF-8 bytes"? Commented Jan 3, 2016 at 18:59
  • This may be of help Just call that function in the answer on all characters in your string and it should work. Commented Jan 3, 2016 at 19:03

1 Answer 1

0

This gets the results you were looking for:

$string = 'Hello 😀';
var_export(ascii_to_dec($string));

function ascii_to_dec($str)
{
  for ($i = 0, $j = strlen($str); $i < $j; $i++) {
    $dec_array[] = ord($str{$i});
  }
  return $dec_array;
}

Results:

array (
  0 => 72,
  1 => 101,
  2 => 108,
  3 => 108,
  4 => 111,
  5 => 32,
  6 => 240,
  7 => 159,
  8 => 152,
  9 => 128,
)

Source

Sign up to request clarification or add additional context in comments.

1 Comment

You should add a bit of explanation to this one. I think assuming your source file is encoded as UTF-8, the string already contains an UTF-8 encoded string. That function in this context would more accurately be named bytes_to_dec.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.