26

When I use substr() I get a strange character at the end

$articleText = substr($articleText,0,500);

I have an output of 500 chars and � <--

How can I fix this? Is it an encoding problem? My language is Greek.

1
  • Have seen the same thing in (UK) English. Commented Aug 25, 2014 at 11:03

7 Answers 7

61

substr is counting using bytes, and not characters.

greek probably means you are using some multi-byte encoding, like UTF-8 -- and counting per bytes is not quite good for those.

Maybe using mb_substr could help, here : the mb_* functions have been created specifically for multi-byte encodings.

Sign up to request clarification or add additional context in comments.

3 Comments

Learning more and more every single day... Thank you stackoverflow !
Thank you very much. But as for me the main thing is to add mb_internal_encoding("UTF-8"); before using mb_* functions. Without adding it I still see squares.
@Kremchik You won't see squares, if you use mb_substr($short, 0, 75, 'utf-8'). Then you don't need to use mb_internal_encoding before mb_substr.
20

Use mb_substr instead, it is able to deal with multiple encodings, not only single-byte strings as substr:

$articleText = mb_substr($articleText,0,500,'UTF-8');

3 Comments

"UTF-8" part was important for me - don't forget it peeps!
"UTF-8" as optional parameter worked for me. Keep in mind that you might also want to use mb_strlen() if you are using the string length to determine if it must be cut.
An alternative is to use mb_internal_encoding('utf-8') before any mb_* command.
6

Looks like you're slicing a unicode character in half there. Use mb_substr instead for unicode-safe string slicing.

1 Comment

...with calling mb_internal_encoding('utf-8') before or with using 'utf-8' as fourth parameters of mb_substr. Doc says, that it is optional and when it is omitted, the internal character encoding value will be used, but the think is (explained somewhere else in PHP doc), that PHP's "internal encoding" in nearly always "something else" than your page encoding. So for slicing UTF8 string, this fourth parameter or calling mb_internal_encoding('utf-8') becomes required.
1

use this function, It worked for me

function substr_unicode($str, $s, $l = null) {
    return join("", array_slice(
        preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY), $s, $l));
}

Credits: http://php.net/manual/en/function.mb-substr.php#107698

Comments

0

ms_substr() also works excellently for removing strange trailing line breaks as well, which I was having trouble with after parsing html code. The problem was NOT handled by:

 trim() 

or:

 var_dump(preg_match('/^\n|\n$/', $variable));

or:

str_replace (array('\r\n', '\n', '\r'), ' ', $text)

Don't catch.

Comments

0

Alternative solution for UTF-8 encoded strings - this will convert UTF-8 to characters before cutting the sub-string.

$articleText = substr(utf8_decode($articleText),0,500);

To get the articleText string back to UTF-8, an extra operation will be needed:

$articleText = utf8_encode( substr(utf8_decode($articleText),0,500) );

1 Comment

This doesn't work at all.
0

You are trying to cut unicode character.So i preferred instead of substr() try mb_substr() in php.

substr()

substr ( string $string , int $start [, int $length ] )

mb_substr()

mb_substr ( string $str , int $start [, int $length [, string $encoding ]] )

For more information for substr() - Credits => Check Here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.