1

I'm kinda new to encoding issues

I have CSV file I get from a client and can't figure out how it is encoded

I have "é" accents that appears like � in vim or openoffice, when I try to encode them to utf8 using mb_convert_encoding( $string, "UTF-8" ) or utf8_encode($string) I get "�"

I tried some latin encodings (ISO-8859-1, ISO-8859-15) to utf8 with iconv and mb_convert_encoding

I also tried a method I found to convert from cp1250 to utf8 and another one from macintosh to utf8

Still no luck. Is there any way to find a solution without asking the client to change his csv encoding to utf8 ?

Thanks a lot !

EDIT In order to find the correct encoding I parsed all the encodings listed in mb_list_encodings() and tried to convert to UTF-8 with each of them. None of them could render a "é". I'll just ask the client to use utf-8 when he exports his csv

Using vim to get the hexadecimal value of the wrong character I can say the � character is actually in the file and the encoding issue is client-side

5
  • Can you use an editor that displays the hexadecimal values of each character? Then post back the result for é and other problematic values, that will help us guess the encoding. Commented Jun 26, 2014 at 9:05
  • Tell us what language (English, French, Chinese ... whatever) the CSV file data is supposed to be in, only then we can find right encoding scheme for your data Commented Jun 26, 2014 at 9:09
  • Make a copy of original file(s) first if you are committed at testing all encoding schemes available in your editor, this will cause irreversible loss of data Commented Jun 26, 2014 at 9:12
  • @TimPietzcker : vim "ga" command returns <�> 65533, Hexa fffd, Octal 177775 Commented Jun 26, 2014 at 9:46
  • @TimPietzcker : I guess it means the file itself contains the � character and the encoding issue is client-side. Commented Jun 26, 2014 at 9:53

1 Answer 1

2

You need to know what encoding a file is in, period. If you don't know that, try to view the document as a bunch of different encodings (e.g. in some text editors you have the option of File → Reopen using Encoding... or similar such actions), until you find the encoding that the file makes sense in.

That, or convert the file from different encodings to your preferred encoding. Just mb_convert_encoding($string, "UTF-8") won't help, it can't magically guess what to convert from. Try:

echo mb_convert_encoding($string, 'UTF-8', 'ISO-8859-1');
echo mb_convert_encoding($string, 'UTF-8', 'SJIS');
...

Until you've found the encoding where the document looks correct.

If all that guessing doesn't help, ask the originator of the document to pay attention to what encoding they're using, or tell them explicitly what to do to provide you the document in the encoding you need.

Read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Sign up to request clarification or add additional context in comments.

3 Comments

Classic article everyone should read about encoding! +1 for the link
Huh I wrote that too fast :s I just wanted to add that mb_convert_encoding($string, "UTF-8") will encode from the internal encoding (ISO-8859-1 in my case) to UTF-8. it is thus equivalent to utf8_encode() which solves most of my encoding issues

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.