optimizing a php function that trims strings

Question

i programmed this php function that takes any text/html string and trims it.

For example:

gen_string("Hello, how are you today?",10);

Returns: Hello, how...

The problem arises when the function string limit is the same as the position of a special character such as: á, ñ, etc...

In which case:

gen_string("Helló my friend",5);

Returns: Hell�...

Any ideas on how to solve this issue? This is the current function:

# string: advanced substr
function gen_string($string,$min,$clean=false) {
 $text = trim(strip_tags($string));
 if(strlen($text)>$min) {
  $blank = strpos($text,' ');
  if($blank) {
   # limit plus last word
   $extra = strpos(substr($text,$min),' ');
   $max = $min+$extra;
   $r = substr($text,0,$max);
   if(strlen($text)>=$max && !$clean) $r=trim($r,'.').'...';
  } else {
   # if there are no spaces
   $r = substr($text,0,$min).'...';
  }
 } else {
  # if original length is lower than limit
  $r = $text;
 }
 return trim($r);
}

Thanks!

You need to use the mbstring functions. Especially mb_substr() php.net/mb_substr and mb_strpos() php.net/mb_strpos — Frank Farmer
– Frank Farmer, Commented Jul 8, 2010 at 18:00
weird... Call to undefined function mb_strimwidth() -- and i do have PHP 5 — Andres SK
– Andres SK, Commented Jul 8, 2010 at 18:12

Mark Byers · Accepted Answer · 2010-07-08 17:59:46Z

4

You should use the multibyte string functions to correctly handle unicode characters.

For example you could try using mb_strimwidth to truncate a string to a specified length.

answered Jul 8, 2010 at 17:59

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Andres SK Over a year ago

seems like a good option, but i got the "Call to undefined function mb_strimwidth()" error msg -- and i do have PHP 5

Mark Byers Over a year ago

@andufo: You might want to see this related question regarding enabling the multibyte string functions: stackoverflow.com/questions/2294393/…

salathe · Accepted Answer · 2010-07-08 21:22:31Z

1

You could also take a different approach and make use of the PCRE regex extension's UTF-8 capabilities (assuming your strings are UTF-8!).

function gen_string($string, $length)
{
    $str    = trim(strip_tags($string));
    $strlen = strlen(utf8_decode($str));
    // String is less than limit
    if ($strlen <= $length) return $str;
    // Shorten string, preserving whole "words" (non-whitespace)
    preg_match('/^.{'.($length-1).'}\S*/su', $str, $match);
    // Append ellipsis if needed (bytes length is OK to check)
    if (strlen($match[0]) !== strlen($str)) $match[0] .= '...';
    return $match[0];
}

answered Jul 8, 2010 at 21:22

salathe

52.1k12 gold badges108 silver badges134 bronze badges

Comments

gblazex · Accepted Answer · 2010-07-08 18:11:35Z

0

Aside from the multibyte issue, maybe you can write it shorter

function gen_string($str, $limit) {
    if ($str >= strlen($limit)) 
      return $str;
    $offset = -(strlen($str) - $limit);
    return substr($str, 0, strrpos($str, ' ', $offset)).'...';
}

It will limit the length of the string, so rather than cut it after the first word beyond the limit, it ensures that the length is never larger than the limit.

edited Jul 8, 2010 at 18:11

answered Jul 8, 2010 at 18:05

gblazex

50.3k12 gold badges100 silver badges92 bronze badges

1 Comment

gblazex Over a year ago

Well I don't really know what can cause your problem because I use the code above in production without any problems. And we use űúőó characters :) Also I checked it right now and it's fine in my localhost. Are you sure everything is right in your server settings?

avpaderno · Accepted Answer · 2010-07-08 21:47:21Z

strlen() cannot be used for UTF-8 string, because it would count also the continuation characters, which should not be counted.

You can try with the following code:

define('PREG_CLASS_UNICODE_WORD_BOUNDARY', 
  '\x{0}-\x{2F}\x{3A}-\x{40}\x{5B}-\x{60}\x{7B}-\x{A9}\x{AB}-\x{B1}\x{B4}' .
  '\x{B6}-\x{B8}\x{BB}\x{BF}\x{D7}\x{F7}\x{2C2}-\x{2C5}\x{2D2}-\x{2DF}' .
  '\x{2E5}-\x{2EB}\x{2ED}\x{2EF}-\x{2FF}\x{375}\x{37E}-\x{385}\x{387}\x{3F6}' .
  '\x{482}\x{55A}-\x{55F}\x{589}-\x{58A}\x{5BE}\x{5C0}\x{5C3}\x{5C6}' .
  '\x{5F3}-\x{60F}\x{61B}-\x{61F}\x{66A}-\x{66D}\x{6D4}\x{6DD}\x{6E9}' .
  '\x{6FD}-\x{6FE}\x{700}-\x{70F}\x{7F6}-\x{7F9}\x{830}-\x{83E}' .
  '\x{964}-\x{965}\x{970}\x{9F2}-\x{9F3}\x{9FA}-\x{9FB}\x{AF1}\x{B70}' .
  '\x{BF3}-\x{BFA}\x{C7F}\x{CF1}-\x{CF2}\x{D79}\x{DF4}\x{E3F}\x{E4F}' .
  '\x{E5A}-\x{E5B}\x{F01}-\x{F17}\x{F1A}-\x{F1F}\x{F34}\x{F36}\x{F38}' .
  '\x{F3A}-\x{F3D}\x{F85}\x{FBE}-\x{FC5}\x{FC7}-\x{FD8}\x{104A}-\x{104F}' .
  '\x{109E}-\x{109F}\x{10FB}\x{1360}-\x{1368}\x{1390}-\x{1399}\x{1400}' .
  '\x{166D}-\x{166E}\x{1680}\x{169B}-\x{169C}\x{16EB}-\x{16ED}' .
  '\x{1735}-\x{1736}\x{17B4}-\x{17B5}\x{17D4}-\x{17D6}\x{17D8}-\x{17DB}' .
  '\x{1800}-\x{180A}\x{180E}\x{1940}-\x{1945}\x{19DE}-\x{19FF}' .
  '\x{1A1E}-\x{1A1F}\x{1AA0}-\x{1AA6}\x{1AA8}-\x{1AAD}\x{1B5A}-\x{1B6A}' .
  '\x{1B74}-\x{1B7C}\x{1C3B}-\x{1C3F}\x{1C7E}-\x{1C7F}\x{1CD3}\x{1FBD}' .
  '\x{1FBF}-\x{1FC1}\x{1FCD}-\x{1FCF}\x{1FDD}-\x{1FDF}\x{1FED}-\x{1FEF}' .
  '\x{1FFD}-\x{206F}\x{207A}-\x{207E}\x{208A}-\x{208E}\x{20A0}-\x{20B8}' .
  '\x{2100}-\x{2101}\x{2103}-\x{2106}\x{2108}-\x{2109}\x{2114}' .
  '\x{2116}-\x{2118}\x{211E}-\x{2123}\x{2125}\x{2127}\x{2129}\x{212E}' .
  '\x{213A}-\x{213B}\x{2140}-\x{2144}\x{214A}-\x{214D}\x{214F}' .
  '\x{2190}-\x{244A}\x{249C}-\x{24E9}\x{2500}-\x{2775}\x{2794}-\x{2B59}' .
  '\x{2CE5}-\x{2CEA}\x{2CF9}-\x{2CFC}\x{2CFE}-\x{2CFF}\x{2E00}-\x{2E2E}' .
  '\x{2E30}-\x{3004}\x{3008}-\x{3020}\x{3030}\x{3036}-\x{3037}' .
  '\x{303D}-\x{303F}\x{309B}-\x{309C}\x{30A0}\x{30FB}\x{3190}-\x{3191}' .
  '\x{3196}-\x{319F}\x{31C0}-\x{31E3}\x{3200}-\x{321E}\x{322A}-\x{3250}' .
  '\x{3260}-\x{327F}\x{328A}-\x{32B0}\x{32C0}-\x{33FF}\x{4DC0}-\x{4DFF}' .
  '\x{A490}-\x{A4C6}\x{A4FE}-\x{A4FF}\x{A60D}-\x{A60F}\x{A673}\x{A67E}' .
  '\x{A6F2}-\x{A716}\x{A720}-\x{A721}\x{A789}-\x{A78A}\x{A828}-\x{A82B}' .
  '\x{A836}-\x{A839}\x{A874}-\x{A877}\x{A8CE}-\x{A8CF}\x{A8F8}-\x{A8FA}' .
  '\x{A92E}-\x{A92F}\x{A95F}\x{A9C1}-\x{A9CD}\x{A9DE}-\x{A9DF}' .
  '\x{AA5C}-\x{AA5F}\x{AA77}-\x{AA79}\x{AADE}-\x{AADF}\x{ABEB}' .
  '\x{D800}-\x{F8FF}\x{FB29}\x{FD3E}-\x{FD3F}\x{FDFC}-\x{FDFD}' .
  '\x{FE10}-\x{FE19}\x{FE30}-\x{FE6B}\x{FEFF}-\x{FF0F}\x{FF1A}-\x{FF20}' .
  '\x{FF3B}-\x{FF40}\x{FF5B}-\x{FF65}\x{FFE0}-\x{FFFD}');

function utf8_strlen($text) {
  if (function_exists('mb_strlen')) {
    return mb_strlen($text);
  }

  // Do not count UTF-8 continuation bytes.
  return strlen(preg_replace("/[\x80-\xBF]/", '', $text));
}

function utf8_truncate($string, $max_length, $wordsafe = FALSE, $add_ellipsis = FALSE, $min_wordsafe_length = 1) {
  $ellipsis = '';
  $max_length = max($max_length, 0);
  $min_wordsafe_length = max($min_wordsafe_length, 0);

  if (utf8_strlen($string) <= $max_length) {
    // No truncation needed, so don't add ellipsis, just return.
    return $string;
  }

  if ($add_ellipsis) {
    // Truncate ellipsis in case $max_length is small.
    $ellipsis = utf8_substr('...', 0, $max_length);
    $max_length -= utf8_strlen($ellipsis);
    $max_length = max($max_length, 0);
  }

  if ($max_length <= $min_wordsafe_length) {
    // Do not attempt word-safe if lengths are bad.
    $wordsafe = FALSE;
  }

  if ($wordsafe) {
    $matches = array();
    // Find the last word boundary, if there is one within $min_wordsafe_length
    // to $max_length characters. preg_match() is always greedy, so it will
    // find the longest string possible.
    $found = preg_match('/^(.{' . $min_wordsafe_length . ',' . $max_length . '})[' . PREG_CLASS_UNICODE_WORD_BOUNDARY . ']/u', $string, $matches);
    if ($found) {
      $string = $matches[1];
    }
    else {
      $string = utf8_substr($string, 0, $max_length);
    }
  }
  else {
    $string = utf8_substr($string, 0, $max_length);
  }

  if ($add_ellipsis) {
    $string .= $ellipsis;
  }

  return $string;
}

function utf8_substr($text, $start, $length = NULL) {
  if (function_exists('mb_substr')) {
    return $length === NULL ? mb_substr($text, $start) : mb_substr($text, $start, $length);
  }
  else {
    $strlen = strlen($text);
    // Find the starting byte offset.
    $bytes = 0;
    if ($start > 0) {
      // Count all the continuation bytes from the start until we have found
      // $start characters or the end of the string.
      $bytes = -1;
      $chars = -1;
      while ($bytes < $strlen - 1 && $chars < $start) {
        $bytes++;
        $c = ord($text[$bytes]);
        if ($c < 0x80 || $c >= 0xC0) {
          $chars++;
        }
      }
    }
    elseif ($start < 0) {
      // Count all the continuation bytes from the end until we have found
      // abs($start) characters.
      $start = abs($start);
      $bytes = $strlen;
      $chars = 0;
      while ($bytes > 0 && $chars < $start) {
        $bytes--;
        $c = ord($text[$bytes]);
        if ($c < 0x80 || $c >= 0xC0) {
          $chars++;
        }
      }
    }
    $istart = $bytes;

    // Find the ending byte offset.
    if ($length === NULL) {
      $iend = $strlen;
    }
    elseif ($length > 0) {
      // Count all the continuation bytes from the starting index until we have
      // found $length characters or reached the end of the string, then
      // backtrace one byte.
      $iend = $istart - 1;
      $chars = -1;
      $last_real = FALSE;
      while ($iend < $strlen - 1 && $chars < $length) {
        $iend++;
        $c = ord($text[$iend]);
        $last_real = FALSE;
        if ($c < 0x80 || $c >= 0xC0) {
          $chars++;
          $last_real = TRUE;
        }
      }
      // Backtrace one byte if the last character we found was a real character
      // and we don't need it.
      if ($last_real && $chars >= $length) {
        $iend--;
      }
    }
    elseif ($length < 0) {
      // Count all the continuation bytes from the end until we have found
      // abs($start) characters, then backtrace one byte.
      $length = abs($length);
      $iend = $strlen;
      $chars = 0;
      while ($iend > 0 && $chars < $length) {
        $iend--;
        $c = ord($text[$iend]);
        if ($c < 0x80 || $c >= 0xC0) {
          $chars++;
        }
      }
      // Backtrace one byte if we are not at the beginning of the string.
      if ($iend > 0) {
        $iend--;
      }
    }
    else {
      // $length == 0, return an empty string.
      return '';
    }

    return substr($text, $istart, max(0, $iend - $istart + 1));
  }
}

greenie · Accepted Answer · 2010-07-08 18:09:31Z

-1

For your return statement you could try:

return htmlspecialchars(trim($r));

EDIT: I tried your code as you provided it and it ran fine for me without having to use htmlspecialchars(). This is probably due to the face that in the <head> of the page the code was running on, the charset was set to UTF-8. So your options could be to set the encoding of the page like this:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

or to use htmlspecialchars() as above.

edited Jul 8, 2010 at 18:09

answered Jul 8, 2010 at 17:57

greenie

1,7404 gold badges18 silver badges33 bronze badges

2 Comments

user3850 Over a year ago

the � comes from half a unicode character and htmlspecialchars() won't help there.

greenie Over a year ago

I'm stumped then. Your original code worked absolutely fine for me on my localhost and in a production environment too. I can only take a guess at looking for things that might be causing the problem outside of what you originally posted.

Collectives™ on Stack Overflow

optimizing a php function that trims strings

5 Answers 5

2 Comments

Comments

1 Comment

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related