PHP : Unknown encoding in CSV file

Question

I'm kinda new to encoding issues

I have CSV file I get from a client and can't figure out how it is encoded

I have "é" accents that appears like � in vim or openoffice, when I try to encode them to utf8 using mb_convert_encoding( $string, "UTF-8" ) or utf8_encode($string) I get "ï¿½"

I tried some latin encodings (ISO-8859-1, ISO-8859-15) to utf8 with iconv and mb_convert_encoding

I also tried a method I found to convert from cp1250 to utf8 and another one from macintosh to utf8

Still no luck. Is there any way to find a solution without asking the client to change his csv encoding to utf8 ?

Thanks a lot !

EDIT In order to find the correct encoding I parsed all the encodings listed in mb_list_encodings() and tried to convert to UTF-8 with each of them. None of them could render a "é". I'll just ask the client to use utf-8 when he exports his csv

Using vim to get the hexadecimal value of the wrong character I can say the � character is actually in the file and the encoding issue is client-side

Can you use an editor that displays the hexadecimal values of each character? Then post back the result for é and other problematic values, that will help us guess the encoding. — Tim Pietzcker
– Tim Pietzcker, Commented Jun 26, 2014 at 9:05
Tell us what language (English, French, Chinese ... whatever) the CSV file data is supposed to be in, only then we can find right encoding scheme for your data — Ayub
– Ayub, Commented Jun 26, 2014 at 9:09
Make a copy of original file(s) first if you are committed at testing all encoding schemes available in your editor, this will cause irreversible loss of data — Ayub
– Ayub, Commented Jun 26, 2014 at 9:12
@TimPietzcker : vim "ga" command returns <�> 65533, Hexa fffd, Octal 177775 — user3316439
– user3316439, Commented Jun 26, 2014 at 9:46
@TimPietzcker : I guess it means the file itself contains the � character and the encoding issue is client-side. — user3316439
– user3316439, Commented Jun 26, 2014 at 9:53

deceze · Accepted Answer · 2014-06-26 09:05:33Z

2

You need to know what encoding a file is in, period. If you don't know that, try to view the document as a bunch of different encodings (e.g. in some text editors you have the option of File → Reopen using Encoding... or similar such actions), until you find the encoding that the file makes sense in.

That, or convert the file from different encodings to your preferred encoding. Just mb_convert_encoding($string, "UTF-8") won't help, it can't magically guess what to convert from. Try:

echo mb_convert_encoding($string, 'UTF-8', 'ISO-8859-1');
echo mb_convert_encoding($string, 'UTF-8', 'SJIS');
...

Until you've found the encoding where the document looks correct.

If all that guessing doesn't help, ask the originator of the document to pay attention to what encoding they're using, or tell them explicitly what to do to provide you the document in the encoding you need.

Read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

answered Jun 26, 2014 at 9:05

deceze♦

525k89 gold badges806 silver badges954 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ayub Over a year ago

Classic article everyone should read about encoding! +1 for the link

deceze Over a year ago

So you can update while you update? ;)

user3316439 Over a year ago

Huh I wrote that too fast :s I just wanted to add that mb_convert_encoding($string, "UTF-8") will encode from the internal encoding (ISO-8859-1 in my case) to UTF-8. it is thus equivalent to utf8_encode() which solves most of my encoding issues

Collectives™ on Stack Overflow

PHP : Unknown encoding in CSV file

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related