0

I have referred some post related to unicode error but didn't get any solution for my problem. I am converting xlsx to csv fom a workbook of 6 sheets. Use the following code

def csv_from_excel(file_loc):

    #file_acess check
    print os.access(file_loc, os.R_OK)
    wb = xlrd.open_workbook(file_loc)
    print wb.nsheets

    sheet_names = wb.sheet_names()
    print sheet_names
    counter = 0

    while counter < wb.nsheets:
        try:
            sh = wb.sheet_by_name(sheet_names[counter])
            file_name = str(sheet_names[counter]) + '.csv'
            print file_name
            fh = open(file_name, 'wb')
            wr = csv.writer(fh, quoting=csv.QUOTE_ALL)

            for rownum in xrange(sh.nrows):
                wr.writerow(sh.row_values(rownum))

        except Exception as e:
            print str(e)

        finally:
            fh.close()
            counter += 1

I get an error in 4th sheet

'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)" 

but position 0 is blank and it has converted to csv till 33rd row.

I am unable to figure out. CSV was easy way to read content and put in my data structure .

1 Answer 1

1

You'll need to manually encode Unicode values to bytes; for CSV usually UTF-8 is fine:

for rownum in xrange(sh.nrows):
    wr.writerow([unicode(c).encode('utf8') for c in sh.row_values(rownum)])

Here I use unicode() for column data that is not text.

The character you encountered is the U+2018 LEFT SINGLE QUOTATION MARK, which is just a fancy form of the ' single quote. Office software (spreadsheets, word processors, etc.) often auto-replace single and double quotes with the 'fancy' versions. You could also just replace those with ASCII equivalents. You can do that with the Unidecode package:

from unidecode import unidecode

for rownum in xrange(sh.nrows):
    wr.writerow([unidecode(unicode(c)) for c in sh.row_values(rownum)])

Use this when non-ASCII codepoints are only used for quotes and dashes and other punctuation.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot @martijn-pieters . The first example and direct encoding to utf-8 seems to work.. Is using Unidecode is the fool proof way ? Why does this happen. can't we exclusivily declare the coding standard for one whole file ?
@nij_wiz: The CSV module in Python 2 cannot handle Unicode; it was written well ahead of Unicode support in Python. This has been fixed in Python 3. Unidecode is a pragmatic method to ensure the data only uses ASCII codepoints by replacing any non-ASCII text with ASCII equivalents if available. It depends on your exact data if this is fool-proof.
@martijin : Yup.. I am using 2.7 So this problem .. I am using a lot of third party libraries which are yet to be ported on python 3.i use 3.4 for networking and other jobs. Thanks for your valuable input

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.