Django encoding error when reading from a CSV

Question

When I try to run:

import csv

with open('data.csv', 'rU') as csvfile:
  reader = csv.DictReader(csvfile)
  for row in reader:
    pgd = Player.objects.get_or_create(
      player_name=row['Player'],
      team=row['Team'], 
      position=row['Position']
    )

Most of my data gets created in the database, except for one particular row. When my script reaches the row, I receive the error:

ProgrammingError: You must not use 8-bit bytestrings unless you use a
text_factory that can interpret 8-bit bytestrings (like text_factory = str). 
It is highly recommended that you instead just switch your application to Unicode strings.`

The particular row in the CSV that causes this error is:

>>> row
{'FR\xed\x8aD\xed\x8aRIC.ST-DENIS', 'BOS', 'G'}

I've looked at the other similar Stackoverflow threads with the same or similar issues, but most aren't specific to using Sqlite with Django. Any advice?

If it matters, I'm running the script by going into the Django shell by calling python manage.py shell, and copy-pasting it in, as opposed to just calling the script from the command line.

This is the stacktrace I get:

Traceback (most recent call last):
  File "<console>", line 4, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
    row = self.reader.next()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 302, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte

EDIT: I decided to just manually import this entry into my database, rather than try to read it from my CSV, based on Alastair McCormack's feedback

Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.

Python 2.x, but this is a new project so if switching to 3.x will make my life easier, I will do so. — Konrad
– Konrad, Commented Sep 30, 2017 at 15:03

Alastair McCormack · Accepted Answer · 2017-09-30 15:17:58Z

1

I suspect you're using Python 2 - open() returns str which are simply byte strings.

The error is telling you that you need to decode your text to Unicode string before use.

The simplest method is to decode each cell:

with open('data.csv', 'r') as csvfile: # 'U' means Universal line mode and is not necessary
  reader = csv.DictReader(csvfile)
  for row in reader:
    pgd = Player.objects.get_or_create(
      player_name=row['Player'].decode('utf-8),
      team=row['Team'].decode('utf-8), 
      position=row['Position'].decode('utf-8)
    )

That'll work but it's ugly add decodes everywhere and it won't work in Python 3. Python 3 improves things by opening files in text mode and returning Python 3 strings which are the equivalent of Unicode strings in Py2.

To get the same functionality in Python 2, use the io module. This gives you a open() method which has an encoding option. Annoyingly, the Python 2.x CSV module is broken with Unicode, so you need to install a backported version:

pip install backports.csv

To tidy your code and future proof it, do:

import io
from backports import csv 

with io.open('data.csv', 'r', encoding='utf-8') as csvfile:
  reader = csv.DictReader(csvfile)
  for row in reader:
    # now every row is automatically decoded from UTF-8
    pgd = Player.objects.get_or_create(
      player_name=row['Player'],
      team=row['Team'], 
      position=row['Position']
    )

edited Sep 30, 2017 at 15:17

answered Sep 30, 2017 at 10:27

Alastair McCormack

28k8 gold badges81 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Konrad Over a year ago

When I added decode, I get this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 2: invalid continuation byte. I'm going to try your backports idea.

Konrad Over a year ago

Using backports didn't work either. It gave me the error UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte on the same troublesome record. I also had to use from io import open

Alastair McCormack Over a year ago

Ah, I assumed the CSV was UTF-8 encoded. What encoding is the CSV?

Konrad Over a year ago

Is there an easy way to check? Unfortunately I'm not the one who created this spreadsheet.

Alastair McCormack Over a year ago

Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.

|

Neeraj Kumar · Accepted Answer · 2017-09-30 10:08:07Z

0

Encode Player name in utf-8 using .encode('utf-8') in player name import csv

with open('data.csv', 'rU') as csvfile:
  reader = csv.DictReader(csvfile)
  for row in reader:
    pgd = Player.objects.get_or_create(
      player_name=row['Player'].encode('utf-8'),
      team=row['Team'], 
      position=row['Position']
    )

answered Sep 30, 2017 at 10:08

Neeraj Kumar

3,9612 gold badges23 silver badges43 bronze badges

2 Comments

Konrad Over a year ago

When I added encode, I get the error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 2: ordinal not in range(128).

Alastair McCormack Over a year ago

That's because the file is already 8bit encoded. .encode() doesn't make sense here

Chidananda Nayak · Accepted Answer · 2019-05-28 10:32:20Z

0

In Django, decode with latin-1, csv.DictReader(io.StringIO(csv_file.read().decode('latin-1'))), it would devour all special characters and all comma exceptions you get in utf-8.

answered May 28, 2019 at 10:32

Chidananda Nayak

1,1912 gold badges15 silver badges47 bronze badges

Collectives™ on Stack Overflow

Django encoding error when reading from a CSV

3 Answers 3

7 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related