python: csv to json conversion when csv contains unicode

Question

I'm trying to use the following code (within web2py) to read a csv file and convert it into a json object:

import csv
import json

originalfilename, file_stream = db.tablename.file.retrieve(info.file) 
file_contents =   file_stream.read()

csv_reader = csv.DictReader(StringIO(file_contents))
json = json.dumps([x for x in csv_reader])

This produces the following error:

'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

Apparently, there is a problem handling the spaces in the .csv file. The problem appears to stem from the json.dumps() line. The traceback from that point on:

Traceback (most recent call last):
  File ".../web2py/gluon/restricted.py", line 212, in restricted
    exec ccode in environment
  File ".../controllers/default.py", line 2345, in <module>
  File ".../web2py/gluon/globals.py", line 194, in <lambda>
    self._caller = lambda f: f()
  File ".../web2py/gluon/tools.py", line 3021, in f
    return action(*a, **b)
  File ".../controllers/default.py", line 697, in generate_vis
    request.vars.json = json.dumps(list(csv_reader))
  File "/usr/local/lib/python2.7/json/__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

Any suggestions regarding how to resolve this, or another way to get a csv file (which contains a header; using StringIO) into a json object that won't produce similar complications? Thank you.

Is this Python 2 or 3? Please do include the full traceback. — Martijn Pieters
– Martijn Pieters, Commented Jul 7, 2013 at 17:34
You don't need to use a list comprehension where a simple list() call would do: json.dumps(list(csv_reader)) would be more efficient. — Martijn Pieters
– Martijn Pieters, Commented Jul 7, 2013 at 17:35
Last but not least, you'll need to share how you read the file with us. What web framework is this? — Martijn Pieters
– Martijn Pieters, Commented Jul 7, 2013 at 17:36
python pandas offers very convenient way or handling csv files: pandas.pydata.org/pandas-docs/stable/generated/…, if that's of any help — Simon Righley
– Simon Righley, Commented Jul 7, 2013 at 17:44

bobince · Accepted Answer · 2013-07-08 16:55:30Z

The csv module (under Python 2) is purely byte-based; all strings you get out of it are bytes. However JSON is Unicode character-based, so there is an implicit conversion when you try to write out the bytes you got from CSV into JSON. Python guessed UTF-8 for this, but your CSV file wasn't UTF-8 - it was probably Windows code page 1252 (Western European - like ISO-8859-1 only not quite).

A quick fix would be to transcode your input (file_contents= file_contents.decode('windows-1252').encode('utf-8')), but probably you don't really want to rely on json guessing a particular encoding.

Best would be to explicitly decode your strings at the point of reading them from CSV. Then JSON will be able to cope with them OK. Unfortately csv doesn't have built-in decoding (at least in this Python version), but you can do it manually:

class UnicodeDictReader(csv.DictReader):
    def __init__(self, f, encoding, *args, **kwargs):
        csv.DictReader.__init__(self, f, *args, **kwargs)
        self.encoding = encoding
    def next(self):
        return {
            k.decode(self.encoding): v.decode(self.encoding)
            for (k, v) in csv.DictReader.next(self).items()
        }

csv_reader = UnicodeDictReader(StringIO(file_contents), 'windows-1252')
json_output = json.dumps(list(csv_reader))

it's not known in advance what sort of encoding will come up

Well that's more of a problem, since it's impossible to guess accurately what encoding a file is in. You would either have to specific a particular encoding, or give the user a way to signal what the encoding is, if you want to support non-ASCII characters properly.

ChrisProsser · Accepted Answer · 2013-07-07 18:49:03Z

2

Try replacing your final line with

json = json.dumps([x.encode('utf-8') for x in csv_reader])

answered Jul 7, 2013 at 18:49

ChrisProsser

13.2k6 gold badges38 silver badges45 bronze badges

1 Comment

Lamps1829 Over a year ago

The specific character that's causing the issue in this particular case is '\xa0'; encode('utf-8') produces an error when encountering it.

Lamps1829 · Accepted Answer · 2013-07-07 19:42:17Z

1

Running unidecode over the file contents seems to do the trick:

from isounidecode import unidecode

...

file_contents =   unidecode(file_stream.read())

...

Thanks, everyone!

answered Jul 7, 2013 at 19:42

Lamps1829

2,4914 gold badges26 silver badges33 bronze badges

2 Comments

bobince Over a year ago

That's replacing all non-ASCII characters with mangled best-fit ASCII versions - are you sure you want to do that?

Lamps1829 Over a year ago

You make a good point. It may not be a universal solution, but works for my purposes, and seems to better than some other options at dealing with a multitude of cases in such a way that no error is produced (as opposed to encode(), which gets tripped up by '\xa0', for example).

Collectives™ on Stack Overflow

python: csv to json conversion when csv contains unicode

3 Answers 3

Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related