0

I'm trying to use the following code (within web2py) to read a csv file and convert it into a json object:

import csv
import json

originalfilename, file_stream = db.tablename.file.retrieve(info.file) 
file_contents =   file_stream.read()

csv_reader = csv.DictReader(StringIO(file_contents))
json = json.dumps([x for x in csv_reader])

This produces the following error:

'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

Apparently, there is a problem handling the spaces in the .csv file. The problem appears to stem from the json.dumps() line. The traceback from that point on:

Traceback (most recent call last):
  File ".../web2py/gluon/restricted.py", line 212, in restricted
    exec ccode in environment
  File ".../controllers/default.py", line 2345, in <module>
  File ".../web2py/gluon/globals.py", line 194, in <lambda>
    self._caller = lambda f: f()
  File ".../web2py/gluon/tools.py", line 3021, in f
    return action(*a, **b)
  File ".../controllers/default.py", line 697, in generate_vis
    request.vars.json = json.dumps(list(csv_reader))
  File "/usr/local/lib/python2.7/json/__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

Any suggestions regarding how to resolve this, or another way to get a csv file (which contains a header; using StringIO) into a json object that won't produce similar complications? Thank you.

11
  • Is this Python 2 or 3? Please do include the full traceback. Commented Jul 7, 2013 at 17:34
  • 1
    You don't need to use a list comprehension where a simple list() call would do: json.dumps(list(csv_reader)) would be more efficient. Commented Jul 7, 2013 at 17:35
  • 2
    Last but not least, you'll need to share how you read the file with us. What web framework is this? Commented Jul 7, 2013 at 17:36
  • Please clarify the exactly line with a error. Commented Jul 7, 2013 at 17:40
  • python pandas offers very convenient way or handling csv files: pandas.pydata.org/pandas-docs/stable/generated/…, if that's of any help Commented Jul 7, 2013 at 17:44

3 Answers 3

3

The csv module (under Python 2) is purely byte-based; all strings you get out of it are bytes. However JSON is Unicode character-based, so there is an implicit conversion when you try to write out the bytes you got from CSV into JSON. Python guessed UTF-8 for this, but your CSV file wasn't UTF-8 - it was probably Windows code page 1252 (Western European - like ISO-8859-1 only not quite).

A quick fix would be to transcode your input (file_contents= file_contents.decode('windows-1252').encode('utf-8')), but probably you don't really want to rely on json guessing a particular encoding.

Best would be to explicitly decode your strings at the point of reading them from CSV. Then JSON will be able to cope with them OK. Unfortately csv doesn't have built-in decoding (at least in this Python version), but you can do it manually:

class UnicodeDictReader(csv.DictReader):
    def __init__(self, f, encoding, *args, **kwargs):
        csv.DictReader.__init__(self, f, *args, **kwargs)
        self.encoding = encoding
    def next(self):
        return {
            k.decode(self.encoding): v.decode(self.encoding)
            for (k, v) in csv.DictReader.next(self).items()
        }

csv_reader = UnicodeDictReader(StringIO(file_contents), 'windows-1252')
json_output = json.dumps(list(csv_reader))

it's not known in advance what sort of encoding will come up

Well that's more of a problem, since it's impossible to guess accurately what encoding a file is in. You would either have to specific a particular encoding, or give the user a way to signal what the encoding is, if you want to support non-ASCII characters properly.

Sign up to request clarification or add additional context in comments.

Comments

2

Try replacing your final line with

json = json.dumps([x.encode('utf-8') for x in csv_reader])

1 Comment

The specific character that's causing the issue in this particular case is '\xa0'; encode('utf-8') produces an error when encountering it.
1

Running unidecode over the file contents seems to do the trick:

from isounidecode import unidecode

...

file_contents =   unidecode(file_stream.read())

...

Thanks, everyone!

2 Comments

That's replacing all non-ASCII characters with mangled best-fit ASCII versions - are you sure you want to do that?
You make a good point. It may not be a universal solution, but works for my purposes, and seems to better than some other options at dealing with a multitude of cases in such a way that no error is produced (as opposed to encode(), which gets tripped up by '\xa0', for example).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.