Any way to get correct conversion for unicode text format data to csv in python?

Question

I am accessing dataset that lives on ftp server. after I download the data, I used pandas to read it as csv but I got an encoding error. The file has csv file extension but after I opened the file with MS excell, data was in Unicode Text format. I want to make conversion of those dataset that stored in Unicode text format. How can I make this happen? Any idea to get this done?

my attempt:

from ftplib import FTP
import os

def mydef():
defaultIP=''
username='cat'
password='cat'

ftp = FTP(defaultIP,user=username, passwd=password)
ftp.dir()

filenames=ftp.nlst() 

for filename in files:
    local_filename = os.path.join('C:\\Users\\me', filename)
    file = open(local_filename, 'wb')
    ftp.retrbinary('RETR '+ filename, file.write)

    file.close()

ftp.quit()

then I tried this to get correct encoding:

mydef.encode('utf-8').splitlines()

but this one is not working for me. I used this solution

the output of above code:

here is output snippet of above code:

b'\xff\xfeF\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00\t'

expected output

the expected output of this dataset should be in normal csv data such as common trade data, but encoding doesn't work for me.

I used different encoding for getting the correct conversion of csv format data but none of them works for me. How can I make that work? any idea to get this done? thanks

if it is CSV file then open it in normal text editor to see what you have. It doesn't look like CSV file. Or maybe it doesn't use utf-8 but other encoding - ie. utf-16. `utf-16 sometimes is used on Windows. — furas
– furas, Commented Jan 14, 2020 at 21:18

furas · Accepted Answer · 2020-01-14 21:35:06Z

2

EDIT: I have to change it - now I remove 2 bytes at the beginning (BOM) and one byte at the end because data is incomplete (every char needs 2 bytes)

It seems it is not utf-8 but utf-16 with BOM

If I remove first two bytes (BOM - Bytes Order Mark) and last byte at the end because it is incomplete (every char needs two bytes) and use decode('utf-16-le')

b'F\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00'.decode('utf-16-le')

then I get

'FLOW\tCTY_RPT\tREPORTER\tCTY_PTN\tPARTNER\tCOMMODITY\tDESCRIPTION'

EDIT: meanwhile I found also Python - Decode UTF-16 file with BOM

edited Jan 14, 2020 at 21:35

answered Jan 14, 2020 at 21:24

furas

149k12 gold badges121 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Mark Ransom Over a year ago

BOM in UTF-16 is 2 bytes, not 3.

furas Over a year ago

I couldn't decode it so I removed third byte - but I found problem - text is incomplete and I have to remove last byte and then I can remove 2 bytes BOM at the beginning and decode.

furas Over a year ago

@MarkRansom I changed it - I had to remove last bytes instead of third at the beginning.

Jerry07 Over a year ago

@furas is there any more consistent solution instead of removing the third byte manually? since actual output snippet is a lot

furas Over a year ago

read my current answer - now I remove only two bytes at beginning because BOM has 2 bytes - if you use full data then you don't have to remove third byte - but I had to remove because you gave incomplete data and it didn't work with last byte. in UTF-16 every char uses 2 bytes and data needs even number of bytes to decode it.

Collectives™ on Stack Overflow

Any way to get correct conversion for unicode text format data to csv in python?

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related