PHP mixed UTF-8 encoding reading CSV

Question

I have a problem where I thought it was easy to Google but it seems to be not that easy. Ok, here is my problem:

I have to read a CSV file which has a bad and different encoding. I'm not able to correct the CSV file before hand, so I have to handle it in my application. So the CSV file could have the following-encodings:

'Ü5' and 'MÃ¶belmarkt' in the same file.

If I decode (utf8_decode) the right one is correct and the left (which was correct) is wrong. When I try to find out the encoding (mb_detect_encoding) i always get the answer that this is UTF-8.

I still tried the following solutions:

public function convert( $str ) {
    return iconv( "Windows-1252", "UTF-8", $str );
}

and

private function getUmlauteArray() { 
    return array( 'Ã¼'=>'ü', 'Ã¤'=>'ä', 'Ã¶'=>'ö', 'Ã–'=>'Ö', 'ÃŸ'=>'ß', 'Ã '=>'à', 'Ã¡'=>'á', 'Ã¢'=>'â', 'Ã£'=>'ã', 'Ã¹'=>'ù', 'Ãº'=>'ú', 'Ã»'=>'û', 'Ã™'=>'Ù', 'Ãš'=>'Ú', 'Ã›'=>'Û', 'Ãœ'=>'Ü', 'Ã²'=>'ò', 'Ã³'=>'ó', 'Ã´'=>'ô', 'Ã¨'=>'è', 'Ã©'=>'é', 'Ãª'=>'ê', 'Ã«'=>'ë', 'Ã€'=>'À', 'Ã'=>'Á', 'Ã‚'=>'Â', 'Ãƒ'=>'Ã', 'Ã„'=>'Ä', 'Ã…'=>'Å', 'Ã‡'=>'Ç', 'Ãˆ'=>'È', 'Ã‰'=>'É', 'ÃŠ'=>'Ê', 'Ã‹'=>'Ë', 'ÃŒ'=>'Ì', 'Ã'=>'Í', 'ÃŽ'=>'Î', 'Ã'=>'Ï', 'Ã‘'=>'Ñ', 'Ã’'=>'Ò', 'Ã“'=>'Ó', 'Ã”'=>'Ô', 'Ã•'=>'Õ', 'Ã˜'=>'Ø', 'Ã¥'=>'å', 'Ã¦'=>'æ', 'Ã§'=>'ç', 'Ã¬'=>'ì', 'Ã'=>'í', 'Ã®'=>'î', 'Ã¯'=>'ï', 'Ã°'=>'ð', 'Ã±'=>'ñ', 'Ãµ'=>'õ', 'Ã¸'=>'ø', 'Ã½'=>'ý', 'Ã¿'=>'ÿ', 'â‚¬'=>'€' );
}

public function fixeUmlaute($string) {                  
    $umlaute = $this->getUmlauteArray();
    foreach ($umlaute as $key => $value){
        $value = str_replace($key, $value, $string);
    } 
    return $string;
}

and

function valid_utf8( $string ){
    return !((bool)preg_match('~[\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF\xC0\xC1]~ms',$string));
}

That are all solutions I found with a Google search to change the encoding...(perhaps this "collection" helps anybody else...) So, how can I really detect the wrong characters or where is it my mistake?

Can anybody give me a hint?

Greetz

V

When using mb_detect_encoding(): (1) feed it with possible character sets, without supplying those, the function is next to useless (2) require strict detection. In other words: use the 2nd & 3rd arguments of the function. Pick apart the csv first: I get that the same line can hold different character sets, but I doubt it changes within 1 field, so use fgetcsv(), and 'fix' the entries individually. — Wrikken
– Wrikken, Commented Oct 12, 2013 at 12:52

Davy Baert · Accepted Answer · 2013-10-12 13:56:43Z

1

There is a nice PHP class that can help you with that: https://github.com/neitanod/forceutf8 It will convert any charset to UTF8, and handle the detection for you. Hope it helps.

answered Oct 12, 2013 at 13:56

Davy Baert

5254 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Der_V Over a year ago

Thanks Davy Baert and also Wrikken. I have never seen an example with mb_detect_encoding() where an encoding list is given... I can imagine that it could run also with mb_detect_encoding. I personaly tried the Encoding-class of Davy Beart und it runs very well. That's easy and that what i was looking for. Thanks for help!!

Collectives™ on Stack Overflow

PHP mixed UTF-8 encoding reading CSV

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related