Encoding an array with UTF8 strings before json_encode trouble

Question

I'm trying to process an array of tweets using array_walk encode the text into UTF8 so that any chinese characters are handled properly.

array_walk($tweet_data, function(&$tweet, $key) {
    $tweet['text'] = iconv('Windows-1250', 'UTF-8', $tweet['text']);
});

When I do this, I get the error "Detected an illegal character in input string"

I've also tried this using utf8_encode.

array_walk($tweet_data, function(&$tweet, $key) {
        $tweet['text'] = utf8_encode($tweet['text']);
    });

And this passes through without any issue, but when the text is then displayed on the page, the characters are all wrong.

How can I properly handle UTF8 characters before passing into json_encode so it doesn't break?

Unforunately, that doesn't help. I've tried that already and json_encode just returns an empty result. — Chris R.
– Chris R., Commented Mar 26, 2015 at 15:25

Ghostman · Accepted Answer · 2015-03-26 15:24:48Z

3

This simple php function converts recursively all values of an array to UTF8. The function mb_detect_encoding (line 4) checks if the value already is in UTF8, this way it will not reconvert.

function utf8_converter($array)
{
    array_walk_recursive($array, function(&$item, $key){
        if(!mb_detect_encoding($item, 'utf-8', true)){
                $item = utf8_encode($item);
        }
    });

    return $array;
}

answered Mar 26, 2015 at 15:24

Ghostman

6,1129 gold badges36 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Chris R. Over a year ago

This is almost exactly what I'm doing with the array_walk. I've added in the if statement, and it looks like my strings are already utf8 anyway, so this just garbles it.

Álvaro González · Accepted Answer · 2015-03-26 15:30:45Z

1

Windows-1250 cannot encode Chinese:

Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Polish, Czech, Slovak, Hungarian, Slovene, Bosnian, Croatian, Serbian (Latin script), Romanian (before 1993 spelling reform) and Albanian. It may also be used with the German language

Neither can ISO-8859-1:

is generally intended for Western European languages (see below for a list).

I think you are trying to convert from A to B and you don't know what A is. If you're fully sure is isn't UTF-8 already, you should at least try an encoding that's specifically designed to hold that lang.

answered Mar 26, 2015 at 15:30

Álvaro González

147k45 gold badges282 silver badges378 bronze badges

1 Comment

Chris R. Over a year ago

I've done mb_detect_encoding on the strings and they're already in UTF-8 before passing to json_encode, like they should be.

Collectives™ on Stack Overflow

Encoding an array with UTF8 strings before json_encode trouble

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related