0

When I try to run:

import csv

with open('data.csv', 'rU') as csvfile:
  reader = csv.DictReader(csvfile)
  for row in reader:
    pgd = Player.objects.get_or_create(
      player_name=row['Player'],
      team=row['Team'], 
      position=row['Position']
    )

Most of my data gets created in the database, except for one particular row. When my script reaches the row, I receive the error:

ProgrammingError: You must not use 8-bit bytestrings unless you use a
text_factory that can interpret 8-bit bytestrings (like text_factory = str). 
It is highly recommended that you instead just switch your application to Unicode strings.`

The particular row in the CSV that causes this error is:

>>> row
{'FR\xed\x8aD\xed\x8aRIC.ST-DENIS', 'BOS', 'G'}

I've looked at the other similar Stackoverflow threads with the same or similar issues, but most aren't specific to using Sqlite with Django. Any advice?

If it matters, I'm running the script by going into the Django shell by calling python manage.py shell, and copy-pasting it in, as opposed to just calling the script from the command line.

This is the stacktrace I get:

Traceback (most recent call last):
  File "<console>", line 4, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
    row = self.reader.next()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 302, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte

EDIT: I decided to just manually import this entry into my database, rather than try to read it from my CSV, based on Alastair McCormack's feedback

Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.

2
  • Python 2.x or 3.x? Commented Sep 30, 2017 at 10:12
  • Python 2.x, but this is a new project so if switching to 3.x will make my life easier, I will do so. Commented Sep 30, 2017 at 15:03

3 Answers 3

1

I suspect you're using Python 2 - open() returns str which are simply byte strings.

The error is telling you that you need to decode your text to Unicode string before use.

The simplest method is to decode each cell:

with open('data.csv', 'r') as csvfile: # 'U' means Universal line mode and is not necessary
  reader = csv.DictReader(csvfile)
  for row in reader:
    pgd = Player.objects.get_or_create(
      player_name=row['Player'].decode('utf-8),
      team=row['Team'].decode('utf-8), 
      position=row['Position'].decode('utf-8)
    )

That'll work but it's ugly add decodes everywhere and it won't work in Python 3. Python 3 improves things by opening files in text mode and returning Python 3 strings which are the equivalent of Unicode strings in Py2.

To get the same functionality in Python 2, use the io module. This gives you a open() method which has an encoding option. Annoyingly, the Python 2.x CSV module is broken with Unicode, so you need to install a backported version:

pip install backports.csv

To tidy your code and future proof it, do:

import io
from backports import csv 

with io.open('data.csv', 'r', encoding='utf-8') as csvfile:
  reader = csv.DictReader(csvfile)
  for row in reader:
    # now every row is automatically decoded from UTF-8
    pgd = Player.objects.get_or_create(
      player_name=row['Player'],
      team=row['Team'], 
      position=row['Position']
    )
Sign up to request clarification or add additional context in comments.

7 Comments

When I added decode, I get this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 2: invalid continuation byte. I'm going to try your backports idea.
Using backports didn't work either. It gave me the error UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte on the same troublesome record. I also had to use from io import open
Ah, I assumed the CSV was UTF-8 encoded. What encoding is the CSV?
Is there an easy way to check? Unfortunately I'm not the one who created this spreadsheet.
Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.
|
0

Encode Player name in utf-8 using .encode('utf-8') in player name import csv

with open('data.csv', 'rU') as csvfile:
  reader = csv.DictReader(csvfile)
  for row in reader:
    pgd = Player.objects.get_or_create(
      player_name=row['Player'].encode('utf-8'),
      team=row['Team'], 
      position=row['Position']
    )

2 Comments

When I added encode, I get the error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 2: ordinal not in range(128).
That's because the file is already 8bit encoded. .encode() doesn't make sense here
0

In Django, decode with latin-1, csv.DictReader(io.StringIO(csv_file.read().decode('latin-1'))), it would devour all special characters and all comma exceptions you get in utf-8.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.