1

How can I encode strings on UTF-16BE format in PHP? For "Demo Message!!!" the encoded string should be '00440065006D006F0020004D00650073007300610067006'. Also, I need to encode Arabic characters to this format.

1
  • Sorry, this ain't UTF-8 as you may have already noticed. it seems to be UTF-16BE. Commented May 1, 2010 at 10:34

2 Answers 2

5

First of all, this is absolutly not UTF-8, which is just a charset (i.e. a way to store strings in memory / display them).

WHat you have here looks like a dump of the bytes that are used to build each characters.

If so, you could get those bytes this way :

$str = utf8_encode("Demo Message!!!");

for ($i=0 ; $i<strlen($str) ; $i++) {
    $byte = $str[$i];
    $char = ord($byte);
    printf('%02x ', $char);
}

And you'd get the following output :

44 65 6d 6f 20 4d 65 73 73 61 67 65 21 21 21 

But, once again, this is not UTF-8 : in UTF-8, like you can see in the example I've give, `D` is stored on only one byte : `0x44`

In what you posted, it's stored using two Bytes : 0x00 0x44.

Maybe you're using some kind of UTF-16 ?



EDIT after a bit more testing and @aSeptik's comment : this is indeed UTF-16.

To get the kind of dump you're getting, you'll have to make sure your string is encoded in UTF-16, which could be done this way, using, for example, the mb_convert_encoding function :

$str = mb_convert_encoding("Demo Message!!!", 'UTF-16', 'UTF-8');

Then, it's just a matter of iterating over the bytes that make this string, and dumping their values, like I did before :

for ($i=0 ; $i<strlen($str) ; $i++) {
    $byte = $str[$i];
    $char = ord($byte);
    printf('%02x ', $char);
}

And you'll get the following output :

00 44 00 65 00 6d 00 6f 00 20 00 4d 00 65 00 73 00 73 00 61 00 67 00 65 00 21 00 21 00 21 

Which kind of looks like what youy posted :-)

(you just have to remove the space in the call to printf -- I let it there to get an easier to read output=)

Sign up to request clarification or add additional context in comments.

7 Comments

@aSeptik : Thanks :-) ;; I've edited my answer to add some informations about that :-)
he can also check with mb_detect_encoding('00440065006D006F0020004D00650073007300610067006'); -> ASCII
@Pascal Martin: I'd got the dump by using printf("%04x", $char); instead of printf("%02x ", $char); in your first answer. Now I'm confused. What's the difference?
With %04x, you'll be displaying 4 digits per byte ;;; with %02x, you'll be displaying 2 digits per byte ;;; after that, it's a matter of encoding : with UTF-8, which is what is used in my first portion of code, some characters are stored on one byte, some other are stored on two bytes, some on 3, and, if I remember correctly, some on 4 bytes
all the characters used in your example strings are "simple" characters, stored on 1 byte when using UTF-8, which explains why the first portion of code doesn't output any 00. ;;; but with more complex characters, you'll see that you need to iterate byte by byte, and use %02d, to display the value of each byte.
|
0

E.g. by using the mbstring extension and its mb_convert_encoding() function.

$in = 'Demo Message!!!';
$out = mb_convert_encoding($in, 'UTF-16BE');

for($i=0; $i<strlen($out); $i++) {
  printf("%02X ", ord($out[$i]));
}

prints

00 44 00 65 00 6D 00 6F 00 20 00 4D 00 65 00 73 00 73 00 61 00 67 00 65 00 21 00 21 00 21 

Or by using iconv()

$in = 'Demo Message!!!';
$out = iconv('iso-8859-1', 'UTF-16BE', $in);

for($i=0; $i<strlen($out); $i++) {
  printf("%02X ", ord($out[$i]));
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.