1

I am accessing dataset that lives on ftp server. after I download the data, I used pandas to read it as csv but I got an encoding error. The file has csv file extension but after I opened the file with MS excell, data was in Unicode Text format. I want to make conversion of those dataset that stored in Unicode text format. How can I make this happen? Any idea to get this done?

my attempt:

from ftplib import FTP
import os

def mydef():
defaultIP=''
username='cat'
password='cat'

ftp = FTP(defaultIP,user=username, passwd=password)
ftp.dir()

filenames=ftp.nlst() 

for filename in files:
    local_filename = os.path.join('C:\\Users\\me', filename)
    file = open(local_filename, 'wb')
    ftp.retrbinary('RETR '+ filename, file.write)

    file.close()

ftp.quit()

then I tried this to get correct encoding:

mydef.encode('utf-8').splitlines()

but this one is not working for me. I used this solution

the output of above code:

here is output snippet of above code:

b'\xff\xfeF\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00\t'

expected output

the expected output of this dataset should be in normal csv data such as common trade data, but encoding doesn't work for me.

I used different encoding for getting the correct conversion of csv format data but none of them works for me. How can I make that work? any idea to get this done? thanks

1
  • if it is CSV file then open it in normal text editor to see what you have. It doesn't look like CSV file. Or maybe it doesn't use utf-8 but other encoding - ie. utf-16. `utf-16 sometimes is used on Windows. Commented Jan 14, 2020 at 21:18

1 Answer 1

2

EDIT: I have to change it - now I remove 2 bytes at the beginning (BOM) and one byte at the end because data is incomplete (every char needs 2 bytes)


It seems it is not utf-8 but utf-16 with BOM

If I remove first two bytes (BOM - Bytes Order Mark) and last byte at the end because it is incomplete (every char needs two bytes) and use decode('utf-16-le')

b'F\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00'.decode('utf-16-le')

then I get

'FLOW\tCTY_RPT\tREPORTER\tCTY_PTN\tPARTNER\tCOMMODITY\tDESCRIPTION'

EDIT: meanwhile I found also Python - Decode UTF-16 file with BOM

Sign up to request clarification or add additional context in comments.

5 Comments

BOM in UTF-16 is 2 bytes, not 3.
I couldn't decode it so I removed third byte - but I found problem - text is incomplete and I have to remove last byte and then I can remove 2 bytes BOM at the beginning and decode.
@MarkRansom I changed it - I had to remove last bytes instead of third at the beginning.
@furas is there any more consistent solution instead of removing the third byte manually? since actual output snippet is a lot
read my current answer - now I remove only two bytes at beginning because BOM has 2 bytes - if you use full data then you don't have to remove third byte - but I had to remove because you gave incomplete data and it didn't work with last byte. in UTF-16 every char uses 2 bytes and data needs even number of bytes to decode it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.