1

I'm trying to process an array of tweets using array_walk encode the text into UTF8 so that any chinese characters are handled properly.

array_walk($tweet_data, function(&$tweet, $key) {
    $tweet['text'] = iconv('Windows-1250', 'UTF-8', $tweet['text']);
});

When I do this, I get the error "Detected an illegal character in input string"

I've also tried this using utf8_encode.

array_walk($tweet_data, function(&$tweet, $key) {
        $tweet['text'] = utf8_encode($tweet['text']);
    });

And this passes through without any issue, but when the text is then displayed on the page, the characters are all wrong.

How can I properly handle UTF8 characters before passing into json_encode so it doesn't break?

2
  • json_encode($array, JSON_UNESCAPED_UNICODE) Commented Mar 26, 2015 at 15:20
  • Unforunately, that doesn't help. I've tried that already and json_encode just returns an empty result. Commented Mar 26, 2015 at 15:25

2 Answers 2

3

This simple php function converts recursively all values of an array to UTF8. The function mb_detect_encoding (line 4) checks if the value already is in UTF8, this way it will not reconvert.

function utf8_converter($array)
{
    array_walk_recursive($array, function(&$item, $key){
        if(!mb_detect_encoding($item, 'utf-8', true)){
                $item = utf8_encode($item);
        }
    });

    return $array;
}
Sign up to request clarification or add additional context in comments.

1 Comment

This is almost exactly what I'm doing with the array_walk. I've added in the if statement, and it looks like my strings are already utf8 anyway, so this just garbles it.
1

Windows-1250 cannot encode Chinese:

Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Polish, Czech, Slovak, Hungarian, Slovene, Bosnian, Croatian, Serbian (Latin script), Romanian (before 1993 spelling reform) and Albanian. It may also be used with the German language

Neither can ISO-8859-1:

is generally intended for Western European languages (see below for a list).

I think you are trying to convert from A to B and you don't know what A is. If you're fully sure is isn't UTF-8 already, you should at least try an encoding that's specifically designed to hold that lang.

1 Comment

I've done mb_detect_encoding on the strings and they're already in UTF-8 before passing to json_encode, like they should be.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.