How do I find the number of bytes within UTF-8 string with PHP?

Question

I have the following function from the php.net site to determine the # of bytes in an ASCII and UTF-8 string:

<?php 
/** 
 * Count the number of bytes of a given string. 
 * Input string is expected to be ASCII or UTF-8 encoded. 
 * Warning: the function doesn't return the number of chars 
 * in the string, but the number of bytes. 
 * 
 * @param string $str The string to compute number of bytes 
 * 
 * @return The length in bytes of the given string. 
 */ 
function strBytes($str) 
{ 
  // STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT 

  // Number of characters in string 
  $strlen_var = strlen($str); 

  // string bytes counter 
  $d = 0; 

 /* 
  * Iterate over every character in the string, 
  * escaping with a slash or encoding to UTF-8 where necessary 
  */ 
  for ($c = 0; $c < $strlen_var; ++$c) { 

      $ord_var_c = ord($str{$d}); 

      switch (true) { 
          case (($ord_var_c >= 0x20) && ($ord_var_c <= 0x7F)): 
              // characters U-00000000 - U-0000007F (same as ASCII) 
              $d++; 
              break; 

          case (($ord_var_c & 0xE0) == 0xC0): 
              // characters U-00000080 - U-000007FF, mask 110XXXXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=2; 
              break; 

          case (($ord_var_c & 0xF0) == 0xE0): 
              // characters U-00000800 - U-0000FFFF, mask 1110XXXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=3; 
              break; 

          case (($ord_var_c & 0xF8) == 0xF0): 
              // characters U-00010000 - U-001FFFFF, mask 11110XXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=4; 
              break; 

          case (($ord_var_c & 0xFC) == 0xF8): 
              // characters U-00200000 - U-03FFFFFF, mask 111110XX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=5; 
              break; 

          case (($ord_var_c & 0xFE) == 0xFC): 
              // characters U-04000000 - U-7FFFFFFF, mask 1111110X 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=6; 
              break; 
          default: 
            $d++;    
      } 
  } 

  return $d; 
} 
?>

However when I try this with Russian (e.g. По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число.). It doesn't seem to return the correct number of bytes.

The switch statement is using the default condition. Any ideas why Russian characters would not be working as expected? Or would there be better options for this.

I am asking this as I need to shorten a UTF-8 string to a certain number of bytes. i.e. I can only send a max. of 169 bytes of JSON data to the iPhone APNS in my situation (excluding the other packet data).

Reference: PHP strlen - Manual (Paolo Comment on 10-Jan-2007 03:58)

The function is from the comment in the reference at the bottom of the post. I didn't code it :) However it looks like it is along the right process rather than using mb_strlen, apart from the Russian characters not working. — Luke
– Luke, Commented Mar 5, 2010 at 4:31
It is? Imho if you have multiple elseifs you should use a switch() when possible as in OP. Maybe it is just me. :) — PeeHaa
– PeeHaa, Commented Oct 20, 2012 at 16:14

Michael Borgwardt · Accepted Answer · 2010-03-05 11:51:06Z

5

I am asking this as I need to shorten a utf-8 string to a certain number of bytes.

mb_strcut() does exactly this, though you might not be able to tell from the barely comprehensible documentation.

answered Mar 5, 2010 at 11:51

Michael Borgwardt

347k81 gold badges491 silver badges726 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Luke Over a year ago

Thank you, using mb_strcut() is better than mb_substr() for my situation.

goat · Accepted Answer · 2010-03-05 07:35:48Z

2

strlen() returns the number of bytes.

Shortening a multibyte string to a certain number of bytes is a separate task. You will need to take care not to chop the string off in the middle of a multibyte sequence as you shorten it.

The other thing you need to handle is that when you put a string into json notation, it might need more bytes to represent it as json. For example, if your string contains a double quote character. It needs to be escaped, and the backslash character will add one byte. There's other characters that need to be escaped too. Point is, it can get larger. I assume the byte limit is on the total json payload, so you do need to account for the json syntax itself, as well as any escaping that json will impose on your string.

An unoptimized, kinda hacky way to do it is to chop the string, at say 5 bytes more than your limit, using substr(). Now use mb_strlen() to get number of characters, and mb_substr() to remove the last character. Now encode it as json, and measure the bytes via strlen(). Enter a loop, which keeps chopping off the last character using mb_substr(), encodes as json, and again measure bytes using strlen(). The loop terminates when the number of bytes is acceptable.

answered Mar 5, 2010 at 7:35

goat

31.9k7 gold badges76 silver badges98 bronze badges

2 Comments

Luke Over a year ago

I already have a while loop that keeps chopping 1 character at a time using mb_substr until the bytes falls below the limit. strlen, doesn't seem to return the same # of bytes as the function in my question. strlen() may or may not be overloaded by mb_strlen() as per other comments, due to this it shouldn't be relied on.

goat Over a year ago

So don't overload strlen. If you don't control it, then there's other ways. Eg while (isset($str[$i])) $i++; will do the trick. Or fwrite() it to a stream or something...

Phil Rykoff · Accepted Answer · 2010-03-05 11:45:56Z

2

If you wish to find the byte length of a multi-byte string when you are using mbstring.func_overload 2 and UTF-8 strings, then you can use the following:

mb_strlen($utf8_string, 'latin1');

edited Mar 5, 2010 at 11:45

answered Mar 5, 2010 at 3:04

Phil Rykoff

12.1k3 gold badges41 silver badges64 bronze badges

2 Comments

Luke Over a year ago

Doesn't this just give the string length in the # of characters? I need to know the actual number of bytes that is being used. Within utf-8 a character can be more than one byte, correct?

Phil Rykoff Over a year ago

according to the comments section of php.net/manual/en/function.mb-strlen.php (very bottom), it's widely agreed upon that this function called in the way described will count the BYTES. when you tell the function, your input string contains latin1 (ergo: ascii) chars, he may count every byte as a character, though it may be not a valid character in ascii-sense. could you try this out? i sorrily don't have an mb-enabled environment...

Community · Accepted Answer · 2023-11-17 19:42:58Z

1

In PHP 5, mb_strlen should return the number of characters ; and strlen should return the number of bytes.

For instance, this portion of code :

$string = 'По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число';
echo mb_strlen($string, 'UTF-8') . '<br />';
echo strlen($string);

Should get you the following output :

196
359

As a sidenote : this is one the the things that PHP 6 will change : PHP 6 will be using Unicode by default, which means `strlen` should, in PHP 6, return the number of characters, and not a number of bytes anymore.

edited Nov 17, 2023 at 19:42

CommunityBot

11 silver badge

answered Mar 5, 2010 at 5:27

Pascal MARTIN

402k82 gold badges665 silver badges666 bronze badges

4 Comments

Xorlev Over a year ago

Even with PHP5 that's not an assumption you can make. strlen() may or may not be overloaded by mb_strlen(). It's safer just to call mb_strlen($string, 'latin1');

Luke Over a year ago

The function I have provided in the question seems to work fine for utf-8. I believe the issue to my problem is somewhere else in the iPhone PUSH APNS code. I seem to be able to PUSH around 160 bytes of Japanese, English text etc. However I can only PUSH around 110 bytes of Cyrillic (Russian) characters.

Luke Over a year ago

I still believe that strlen and mb_strlen cannot be relied on to determine the actual bytes.

David Spector Over a year ago

PHP 6? PHP 6? It looks unlikely that PHP will ever "use Unicode by default".

coding Bott · Accepted Answer · 2010-03-05 13:18:38Z

0

Count of Bytes <> String length!

to get the count of byte you can use (php4,5) strlen. to get the unicode string (utf8 encoded) length you can use mb_strlen (take care about function overloading from that extension) or you can simply count all bytes which do not have the 8th bit set.

8th bit means for this unicodechar is coming at least one more byte from input.

answered Mar 5, 2010 at 13:18

coding Bott

4,3871 gold badge29 silver badges44 bronze badges

Collectives™ on Stack Overflow

How do I find the number of bytes within UTF-8 string with PHP?

5 Answers 5

1 Comment

2 Comments

2 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

2 Comments

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related