Convert character encoding to UTF-8 in a .csv file

Question

When I export my LinkedIn connections from:
https://www.linkedin.com/connected/manage_sources
I get a Microsoft Outlook CSV file back.

But when I try to CSV.read the file using Ruby I get the following error:

invalid byte sequence in UTF-8

I'm able to encode the CSV properly by opening it up in Excel and then Save Asing it with UTF-8 encoding.

However, I'd really like to be able to do this from the command line and not have to use Excel at all.

I read in another answer that iconv might be an option. But I wasn't able to get it to work:

iconv -f US-ASCII -t UTF-8 test/fixtures/1481995385116.csv

error:

iconv: test/fixtures/1481995385116.csv:145:19: cannot convert

When I check what kind of file it is I get:

test/fixtures/1481995385116.csv: Non-ISO extended-ASCII text, with very long lines, with CRLF, LF line terminators

Is there a different CLI I can use or am I using iconv wrong?

Edit:

As suggested, the output of hexdump:

➜  c/t/fixtures master ✗ hexdump 1482372034326.csv|head
0000000 22 54 69 74 6c 65 22 2c 22 46 69 72 73 74 20 4e
0000010 61 6d 65 22 2c 22 4d 69 64 64 6c 65 20 4e 61 6d
0000020 65 22 2c 22 4c 61 73 74 20 4e 61 6d 65 22 2c 22
0000030 53 75 66 66 69 78 22 2c 22 45 2d 6d 61 69 6c 20
0000040 41 64 64 72 65 73 73 22 2c 22 45 2d 6d 61 69 6c
0000050 20 32 20 41 64 64 72 65 73 73 22 2c 22 45 2d 6d
0000060 61 69 6c 20 33 20 41 64 64 72 65 73 73 22 2c 22
0000070 42 75 73 69 6e 65 73 73 20 53 74 72 65 65 74 22
0000080 2c 22 42 75 73 69 6e 65 73 73 20 53 74 72 65 65
0000090 74 20 32 22 2c 22 42 75 73 69 6e 65 73 73 20 53
➜  c/t/fixtures master ✗ file 1482002728101.csv
1482002728101.csv: UTF-8 Unicode text, with very long lines, with CR line terminators
➜  c/t/fixtures master ✗ file 1482372034326.csv
1482372034326.csv: Non-ISO extended-ASCII text, with very long lines, with CRLF, LF line terminators
➜  c/t/fixtures master ✗ hexdump -c 1482002728101.csv|head
0000000   T   i   t   l   e   ,   F   i   r   s   t       N   a   m   e
0000010   ,   M   i   d   d   l   e       N   a   m   e   ,   L   a   s
0000020   t       N   a   m   e   ,   S   u   f   f   i   x   ,   E   -
0000030   m   a   i   l       A   d   d   r   e   s   s   ,   E   -   m
0000040   a   i   l       2       A   d   d   r   e   s   s   ,   E   -
0000050   m   a   i   l       3       A   d   d   r   e   s   s   ,   B
0000060   u   s   i   n   e   s   s       S   t   r   e   e   t   ,   B
0000070   u   s   i   n   e   s   s       S   t   r   e   e   t       2
0000080   ,   B   u   s   i   n   e   s   s       S   t   r   e   e   t
0000090       3   ,   B   u   s   i   n   e   s   s       C   i   t   y
➜  c/t/fixtures master ✗ hexdump -c 1482372034326.csv|head
0000000   "   T   i   t   l   e   "   ,   "   F   i   r   s   t       N
0000010   a   m   e   "   ,   "   M   i   d   d   l   e       N   a   m
0000020   e   "   ,   "   L   a   s   t       N   a   m   e   "   ,   "
0000030   S   u   f   f   i   x   "   ,   "   E   -   m   a   i   l
0000040   A   d   d   r   e   s   s   "   ,   "   E   -   m   a   i   l
0000050       2       A   d   d   r   e   s   s   "   ,   "   E   -   m
0000060   a   i   l       3       A   d   d   r   e   s   s   "   ,   "
0000070   B   u   s   i   n   e   s   s       S   t   r   e   e   t   "
0000080   ,   "   B   u   s   i   n   e   s   s       S   t   r   e   e
0000090   t       2   "   ,   "   B   u   s   i   n   e   s   s       S

How do you tell the format from the output?

Conversion from US-ASCII to UTF-8 is a no-operation -- US-ASCII is a proper subset of UTF-8. You must first determine in what character enconding your file is; maybe windows-1252? — AlexP
– AlexP, Commented Dec 17, 2016 at 19:42
I've tried using enca but it didn't work. What tool do you use to figure out what the encoding is? — mbigras
– mbigras, Commented Dec 17, 2016 at 19:49
hexdump of course; post the first few lines of output. But first check whether it's Windows-1252, which in Windows is called ANSI: iconv -f windows-1252 -t utf-8 — AlexP
– AlexP, Commented Dec 17, 2016 at 20:00
iconv -f windows-1252 -t utf-8 will probably work in the sense of producing some result, but you need to check whether it's the right result. Check whether non-ASCII characters are correct in the output. — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Dec 18, 2016 at 23:11
Have you tried iconv -f windows-1252 -t utf-8 as a first attempt? Windows files usually are in that encoding. With hexdump -C you should look for characters between 0x80 and 0xFF. — AlexP
– AlexP, Commented Dec 22, 2016 at 2:39

mbigras · Accepted Answer · 2016-12-22 21:22:36Z

$ iconv -f windows-1252 -t utf-8 linkedin_contacts.csv
.
.
.
"","Ahmet XXXXX","","??
iconv: linkedin_contacts.csv:665:23: cannot convert
$ cat linkedin_contacts.csv|grep Ahmet|hexdump -C| sed -n '1,2p'
00000000  22 22 2c 22 41 68 6d 65  74 20 53 61 6c 69 68 22  |"","Ahmet XXXXX"|
00000010  2c 22 22 2c 22 3f 3f 8d  65 6e 22 2c 22 22 2c 22  |,"","??.en","","|

I looked up the value 8d in an ascii table and it seems like it's in the ISO 8859-1 variation. Checking iconv --list | grep 8859-1 confirms that iconv can handle it.

$ iconv -f ISO-8859-1 -t UTF-8 linkedin_contacts.csv > foo.rb
$ file foo.rb
foo.rb: UTF-8 Unicode text, with very long lines, with CRLF, LF line terminators

Having both those terminators is still a problem for ruby to deal with, but if we chop off the end then it's all good :)

$ sed '$ d' foo.rb > bar.csv
$ file bar.csv
bar.csv: UTF-8 Unicode text, with very long lines, with CRLF line terminators

Stack Exchange Network

Convert character encoding to UTF-8 in a .csv file

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Convert character encoding to UTF-8 in a .csv file

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions