1

I have a problem where I thought it was easy to Google but it seems to be not that easy. Ok, here is my problem:

I have to read a CSV file which has a bad and different encoding. I'm not able to correct the CSV file before hand, so I have to handle it in my application. So the CSV file could have the following-encodings:

'Ü5' and 'Möbelmarkt' in the same file.

If I decode (utf8_decode) the right one is correct and the left (which was correct) is wrong. When I try to find out the encoding (mb_detect_encoding) i always get the answer that this is UTF-8.

I still tried the following solutions:

public function convert( $str ) {
    return iconv( "Windows-1252", "UTF-8", $str );
}

and

private function getUmlauteArray() { 
    return array( 'ü'=>'ü', 'ä'=>'ä', 'ö'=>'ö', 'Ö'=>'Ö', 'ß'=>'ß', 'à '=>'à', 'á'=>'á', 'â'=>'â', 'ã'=>'ã', 'ù'=>'ù', 'ú'=>'ú', 'û'=>'û', 'Ù'=>'Ù', 'Ú'=>'Ú', 'Û'=>'Û', 'Ü'=>'Ü', 'ò'=>'ò', 'ó'=>'ó', 'ô'=>'ô', 'è'=>'è', 'é'=>'é', 'ê'=>'ê', 'ë'=>'ë', 'À'=>'À', 'Ã'=>'Á', 'Â'=>'Â', 'Ã'=>'Ã', 'Ä'=>'Ä', 'Ã…'=>'Å', 'Ç'=>'Ç', 'È'=>'È', 'É'=>'É', 'Ê'=>'Ê', 'Ë'=>'Ë', 'ÃŒ'=>'Ì', 'Ã'=>'Í', 'ÃŽ'=>'Î', 'Ã'=>'Ï', 'Ñ'=>'Ñ', 'Ã’'=>'Ò', 'Ó'=>'Ó', 'Ô'=>'Ô', 'Õ'=>'Õ', 'Ø'=>'Ø', 'Ã¥'=>'å', 'æ'=>'æ', 'ç'=>'ç', 'ì'=>'ì', 'í'=>'í', 'î'=>'î', 'ï'=>'ï', 'ð'=>'ð', 'ñ'=>'ñ', 'õ'=>'õ', 'ø'=>'ø', 'ý'=>'ý', 'ÿ'=>'ÿ', '€'=>'€' );
}

public function fixeUmlaute($string) {                  
    $umlaute = $this->getUmlauteArray();
    foreach ($umlaute as $key => $value){
        $value = str_replace($key, $value, $string);
    } 
    return $string;
}

and

function valid_utf8( $string ){
    return !((bool)preg_match('~[\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF\xC0\xC1]~ms',$string));
}

That are all solutions I found with a Google search to change the encoding...(perhaps this "collection" helps anybody else...) So, how can I really detect the wrong characters or where is it my mistake?

Can anybody give me a hint?

Greetz

V

1
  • When using mb_detect_encoding(): (1) feed it with possible character sets, without supplying those, the function is next to useless (2) require strict detection. In other words: use the 2nd & 3rd arguments of the function. Pick apart the csv first: I get that the same line can hold different character sets, but I doubt it changes within 1 field, so use fgetcsv(), and 'fix' the entries individually. Commented Oct 12, 2013 at 12:52

1 Answer 1

1

There is a nice PHP class that can help you with that: https://github.com/neitanod/forceutf8 It will convert any charset to UTF8, and handle the detection for you. Hope it helps.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Davy Baert and also Wrikken. I have never seen an example with mb_detect_encoding() where an encoding list is given... I can imagine that it could run also with mb_detect_encoding. I personaly tried the Encoding-class of Davy Beart und it runs very well. That's easy and that what i was looking for. Thanks for help!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.