3

I have an xlsx file that I need to convert to csv, I used openpyxl module along with unicodecsv for this. My problem is that while writing some files I am getting some junk characters in output. Details below

One of my file has unicode code point u'\xa0' in it which corresponds to NON BREAK SPACE, but when converted to csv my file shows the  instead of space. While printing the same data on console using Python GUI it prints perfectly without any Â. What am I doing wrong here? any help is appreciated.

Sample Code:

import unicodecsv
from openpyxl import load_workbook

xlsx_file=load_workbook('testfile.xlsx',use_iterators=True)
with open('test_utf.csv','wb') as open_file:
    csv_file=unicodecsv.writer(open_file)
    sheet=xls_file.get_active_sheet()
    for row in sheet.iter_rows():
        csv_file.writerow(cell.internal_value for cell in row)

P.S: The type of data written is Unicode.

5
  • Are you loading the csv back in microsoft excel? The handling of unicode in csv files is a little wonky. Excel by default usually expects csv data to be latin1 Commented Dec 9, 2013 at 10:01
  • yes, I just tried opening it in Notepad++ and it shows space there, so does that mean the  was a result of MS excel internal decoding? Commented Dec 9, 2013 at 10:06
  • Yes. It is normally windows codepage 1252. Before I answer, is this for your personal use or does it need to work for other people? Commented Dec 9, 2013 at 10:08
  • 1
    @Tim: Excel expects CSV data to be encoded in the currently configured codepage. That's a real pain. Commented Dec 9, 2013 at 10:09
  • @MartijnPieters I know, it has bitten me before. Commented Dec 9, 2013 at 10:10

2 Answers 2

2

Okay, so what is going on is that Excel likes to assume that you are using the currently configured codepage. You have a couple of options:

  • Write your data in that codepage. This requires however that you know which one your users will be using.

  • Load the csv file using the "import data" menu option. If you are relying on your users to do this, don't. Most people will not be willing to do this.

  • Use a different program that will accept unicode in csv by default, such as Libre Office.

  • Add a BOM to the beginning of the file to get Excel to recognise utf-8. This may break in other programs.

Since this is for your personal use, if you are only ever going to use Excel, then appending a byte order marker to the beginning is probably the easiest solution.

Sign up to request clarification or add additional context in comments.

Comments

1

Microsoft likes byte-order marks in its text files. Even though a BOM doesn't make sense with UTF-8, it is used as a signature to let Excel know the file is encoded in UTF-8.

Make sure to generate your .csv as UTF-8 with BOM. I created the following using Notepad++:

English,Chinese
American,美国人
Chinese,中国人

The result saved with BOM:

Result with BOM

The result without BOM:

Result without BOM

4 Comments

The problem with that is that a lot of (other) programs will choke on the BOM
@Tim: Not so much choke; the BOM is a valid zero-width space character. They may not particularly like that character if you are doing string matching against the first column without stripping.
@MartijnPieters Valid or not, I have seen a lot of programs that don't handle them correctly at all. Sometimes you even get them displayed erroneously as glyphs. Often these programs have mostly correct handling of utf-16 with a BOM or utf8 without a BOM. I would prefer bug free support everywhere obviously, but the last I checked (about a year ago admittedly) it still seems to be a common issue, and worth mentioning.
I haven't personally come across such programs, but I am not a fan of the way Microsoft has adopted the ZERO WIDTH NO-BREAK SPACE for a UTF-8 BOM, which needs no byte order marker at all, either.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.