0

I've been struggling with this, and have read numerous threads, but I can't seem to get this working. I need to save a UTF-8 CSV file.

Firstly, here's my super-simple approach:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv
import sys
import codecs

f = codecs.open("output.csv", "w", "utf-8-sig")
writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
cells = ["hello".encode("utf-8"), "nǐ hǎo".encode("utf-8"), "你好".encode("utf-8")]
writer.writerow(cells)

That results in an error:

Traceback (most recent call last):
  File "./makesimplecsv.py", line 10, in <module>
    cells = ["hello".encode("utf-8"), "nǐ hǎo".encode("utf-8"), "你好".encode("utf-8")]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc7 in position 1: ordinal not in range(128)

I've also tried using the UnicodeWriter class that's listed in the Python docs (https://docs.python.org/2/library/csv.html#examples ):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv
import sys
import codecs
import cStringIO

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

f = codecs.open("output.csv", "w", "utf-8-sig")
writer = UnicodeWriter(f)
cells = ["hello".encode("utf-8"), "nǐ hǎo".encode("utf-8"), "你好".encode("utf-8")]
writer.writerow(cells)

That results in the same error:

Traceback (most recent call last):
  File "./makesimplecsvwithunicodewriter.sh", line 40, in <module>
    cells = ["hello".encode("utf-8"), "nǐ hǎo".encode("utf-8"), "你好".encode("utf-8")]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc7 in position 1: ordinal not in range(128)

I thought I'd gone through the checklist of things I've found in other similar questions:

  • My file has an encoding statement.
  • I'm opening the file for writing with UTF-8.
  • I'm encoding the individual strings in UTF-8 before I pass them to the CSV writer.
  • I've tried with and without adding a UTF-8 BOM, but that doesn't seem to make any difference, or indeed be critical, from what I've read.

Any ideas on what I'm doing wrong?

2 Answers 2

3

You are writing encoded byte strings to your CSV file. There is little point in doing this when you are expecting Unicode objects.

Don't encode, decode:

cells = ["hello".decode("utf-8"), "nǐ hǎo".decode("utf-8"), "你好".decode("utf-8")]

or use u'...' unicode string literals:

cells = [u"hello", u"nǐ hǎo", u"你好"]

You cannot use a codecs.open() file object with the Python 2 csv module. Either use the UnicodeWriter approach (with a regular file object) and pass in Unicode objects, or encode your cells to byte strings and use the csv.writer() object directly (again with a regular file object), as that's what the UnicodeWriter does; pass encoded byte strings to the csv.writer() object.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! I was able to get this working based on your feedback. I used the UnicodeWriter approach, and switched the encode() calls to decode(), and I used the standard open() function to get the file object for writing. I'm going to update the question with the solution that works for future reference.
@antun: If you feel the need then add your own solution as a new answer; the question should remain just a question.
OK, I'll add it as an answer.
1

UPDATE - SOLUTION

Thanks to the accepted answer I was able to get this working. Here is the full working example for future reference:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv
import sys
import codecs
import cStringIO

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

f = open("output.csv", "w")

writer = UnicodeWriter(f)
cells = ["hello".decode("utf-8"), "nǐ hǎo".decode("utf-8"), "你好".decode("utf-8")]
writer.writerow(cells)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.