6

I have the following function from the php.net site to determine the # of bytes in an ASCII and UTF-8 string:

<?php 
/** 
 * Count the number of bytes of a given string. 
 * Input string is expected to be ASCII or UTF-8 encoded. 
 * Warning: the function doesn't return the number of chars 
 * in the string, but the number of bytes. 
 * 
 * @param string $str The string to compute number of bytes 
 * 
 * @return The length in bytes of the given string. 
 */ 
function strBytes($str) 
{ 
  // STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT 

  // Number of characters in string 
  $strlen_var = strlen($str); 

  // string bytes counter 
  $d = 0; 

 /* 
  * Iterate over every character in the string, 
  * escaping with a slash or encoding to UTF-8 where necessary 
  */ 
  for ($c = 0; $c < $strlen_var; ++$c) { 

      $ord_var_c = ord($str{$d}); 

      switch (true) { 
          case (($ord_var_c >= 0x20) && ($ord_var_c <= 0x7F)): 
              // characters U-00000000 - U-0000007F (same as ASCII) 
              $d++; 
              break; 

          case (($ord_var_c & 0xE0) == 0xC0): 
              // characters U-00000080 - U-000007FF, mask 110XXXXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=2; 
              break; 

          case (($ord_var_c & 0xF0) == 0xE0): 
              // characters U-00000800 - U-0000FFFF, mask 1110XXXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=3; 
              break; 

          case (($ord_var_c & 0xF8) == 0xF0): 
              // characters U-00010000 - U-001FFFFF, mask 11110XXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=4; 
              break; 

          case (($ord_var_c & 0xFC) == 0xF8): 
              // characters U-00200000 - U-03FFFFFF, mask 111110XX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=5; 
              break; 

          case (($ord_var_c & 0xFE) == 0xFC): 
              // characters U-04000000 - U-7FFFFFFF, mask 1111110X 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=6; 
              break; 
          default: 
            $d++;    
      } 
  } 

  return $d; 
} 
?> 

However when I try this with Russian (e.g. По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число.). It doesn't seem to return the correct number of bytes.

The switch statement is using the default condition. Any ideas why Russian characters would not be working as expected? Or would there be better options for this.

I am asking this as I need to shorten a UTF-8 string to a certain number of bytes. i.e. I can only send a max. of 169 bytes of JSON data to the iPhone APNS in my situation (excluding the other packet data).

Reference: PHP strlen - Manual (Paolo Comment on 10-Jan-2007 03:58)

7
  • 5
    switch(true) ? That's an odd way to do things.. Commented Mar 5, 2010 at 2:56
  • The function is from the comment in the reference at the bottom of the post. I didn't code it :) However it looks like it is along the right process rather than using mb_strlen, apart from the Russian characters not working. Commented Mar 5, 2010 at 4:31
  • @Brendan I was just thinking the same thing. Commented Mar 5, 2010 at 5:32
  • 1
    @BrendanLong What is odd about switch(true)? Commented Oct 20, 2012 at 14:03
  • 1
    It is? Imho if you have multiple elseifs you should use a switch() when possible as in OP. Maybe it is just me. :) Commented Oct 20, 2012 at 16:14

5 Answers 5

5

I am asking this as I need to shorten a utf-8 string to a certain number of bytes.

mb_strcut() does exactly this, though you might not be able to tell from the barely comprehensible documentation.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, using mb_strcut() is better than mb_substr() for my situation.
2

strlen() returns the number of bytes.

Shortening a multibyte string to a certain number of bytes is a separate task. You will need to take care not to chop the string off in the middle of a multibyte sequence as you shorten it.

The other thing you need to handle is that when you put a string into json notation, it might need more bytes to represent it as json. For example, if your string contains a double quote character. It needs to be escaped, and the backslash character will add one byte. There's other characters that need to be escaped too. Point is, it can get larger. I assume the byte limit is on the total json payload, so you do need to account for the json syntax itself, as well as any escaping that json will impose on your string.

An unoptimized, kinda hacky way to do it is to chop the string, at say 5 bytes more than your limit, using substr(). Now use mb_strlen() to get number of characters, and mb_substr() to remove the last character. Now encode it as json, and measure the bytes via strlen(). Enter a loop, which keeps chopping off the last character using mb_substr(), encodes as json, and again measure bytes using strlen(). The loop terminates when the number of bytes is acceptable.

2 Comments

I already have a while loop that keeps chopping 1 character at a time using mb_substr until the bytes falls below the limit. strlen, doesn't seem to return the same # of bytes as the function in my question. strlen() may or may not be overloaded by mb_strlen() as per other comments, due to this it shouldn't be relied on.
So don't overload strlen. If you don't control it, then there's other ways. Eg while (isset($str[$i])) $i++; will do the trick. Or fwrite() it to a stream or something...
2

If you wish to find the byte length of a multi-byte string when you are using mbstring.func_overload 2 and UTF-8 strings, then you can use the following:

mb_strlen($utf8_string, 'latin1');

2 Comments

Doesn't this just give the string length in the # of characters? I need to know the actual number of bytes that is being used. Within utf-8 a character can be more than one byte, correct?
according to the comments section of php.net/manual/en/function.mb-strlen.php (very bottom), it's widely agreed upon that this function called in the way described will count the BYTES. when you tell the function, your input string contains latin1 (ergo: ascii) chars, he may count every byte as a character, though it may be not a valid character in ascii-sense. could you try this out? i sorrily don't have an mb-enabled environment...
1

In PHP 5, mb_strlen should return the number of characters ; and strlen should return the number of bytes.

For instance, this portion of code :

$string = 'По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число';
echo mb_strlen($string, 'UTF-8') . '<br />';
echo strlen($string);

Should get you the following output :

196
359

As a sidenote : this is one the the things that PHP 6 will change : PHP 6 will be using Unicode by default, which means `strlen` should, in PHP 6, return the number of characters, and not a number of bytes anymore.

4 Comments

Even with PHP5 that's not an assumption you can make. strlen() may or may not be overloaded by mb_strlen(). It's safer just to call mb_strlen($string, 'latin1');
The function I have provided in the question seems to work fine for utf-8. I believe the issue to my problem is somewhere else in the iPhone PUSH APNS code. I seem to be able to PUSH around 160 bytes of Japanese, English text etc. However I can only PUSH around 110 bytes of Cyrillic (Russian) characters.
I still believe that strlen and mb_strlen cannot be relied on to determine the actual bytes.
PHP 6? PHP 6? It looks unlikely that PHP will ever "use Unicode by default".
0

Count of Bytes <> String length!

to get the count of byte you can use (php4,5) strlen. to get the unicode string (utf8 encoded) length you can use mb_strlen (take care about function overloading from that extension) or you can simply count all bytes which do not have the 8th bit set.

8th bit means for this unicodechar is coming at least one more byte from input.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.