Saving UTF-8 CSV with Python

Question

I've been struggling with this, and have read numerous threads, but I can't seem to get this working. I need to save a UTF-8 CSV file.

Firstly, here's my super-simple approach:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv
import sys
import codecs

f = codecs.open("output.csv", "w", "utf-8-sig")
writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
cells = ["hello".encode("utf-8"), "nǐ hǎo".encode("utf-8"), "你好".encode("utf-8")]
writer.writerow(cells)

That results in an error:

Traceback (most recent call last):
  File "./makesimplecsv.py", line 10, in <module>
    cells = ["hello".encode("utf-8"), "nǐ hǎo".encode("utf-8"), "你好".encode("utf-8")]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc7 in position 1: ordinal not in range(128)

I've also tried using the UnicodeWriter class that's listed in the Python docs (https://docs.python.org/2/library/csv.html#examples ):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv
import sys
import codecs
import cStringIO

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

f = codecs.open("output.csv", "w", "utf-8-sig")
writer = UnicodeWriter(f)
cells = ["hello".encode("utf-8"), "nǐ hǎo".encode("utf-8"), "你好".encode("utf-8")]
writer.writerow(cells)

That results in the same error:

Traceback (most recent call last):
  File "./makesimplecsvwithunicodewriter.sh", line 40, in <module>
    cells = ["hello".encode("utf-8"), "nǐ hǎo".encode("utf-8"), "你好".encode("utf-8")]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc7 in position 1: ordinal not in range(128)

I thought I'd gone through the checklist of things I've found in other similar questions:

My file has an encoding statement.
I'm opening the file for writing with UTF-8.
I'm encoding the individual strings in UTF-8 before I pass them to the CSV writer.
I've tried with and without adding a UTF-8 BOM, but that doesn't seem to make any difference, or indeed be critical, from what I've read.

Any ideas on what I'm doing wrong?

Martijn Pieters · Accepted Answer · 2014-07-25 16:22:56Z

3

You are writing encoded byte strings to your CSV file. There is little point in doing this when you are expecting Unicode objects.

Don't encode, decode:

cells = ["hello".decode("utf-8"), "nǐ hǎo".decode("utf-8"), "你好".decode("utf-8")]

or use u'...' unicode string literals:

cells = [u"hello", u"nǐ hǎo", u"你好"]

You cannot use a codecs.open() file object with the Python 2 csv module. Either use the UnicodeWriter approach (with a regular file object) and pass in Unicode objects, or encode your cells to byte strings and use the csv.writer() object directly (again with a regular file object), as that's what the UnicodeWriter does; pass encoded byte strings to the csv.writer() object.

edited Jul 25, 2014 at 16:22

answered Jul 25, 2014 at 15:28

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

antun Over a year ago

Thanks! I was able to get this working based on your feedback. I used the UnicodeWriter approach, and switched the encode() calls to decode(), and I used the standard open() function to get the file object for writing. I'm going to update the question with the solution that works for future reference.

Martijn Pieters Over a year ago

@antun: If you feel the need then add your own solution as a new answer; the question should remain just a question.

antun Over a year ago

OK, I'll add it as an answer.

antun · Accepted Answer · 2014-07-25 16:28:04Z

UPDATE - SOLUTION

Thanks to the accepted answer I was able to get this working. Here is the full working example for future reference:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv
import sys
import codecs
import cStringIO

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

f = open("output.csv", "w")

writer = UnicodeWriter(f)
cells = ["hello".decode("utf-8"), "nǐ hǎo".decode("utf-8"), "你好".decode("utf-8")]
writer.writerow(cells)

Collectives™ on Stack Overflow

Saving UTF-8 CSV with Python

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related