4

I am using the Google Cloud Storage Client Library.

I am trying to open and process a CSV file (that was already uploaded to a bucket) using code like:

filename = '/<my_bucket/data.csv'
with gcs.open(filename, 'r') as gcs_file:
    csv_reader = csv.reader(gcs_file, delimiter=',', quotechar='"')

I get the error "argument 1 must be an iterator" in response to the first argument to csv.reader (i.e. the gcs_file). Apparently the gcs_file doesn't support the iterator .next method.

Any ideas on how to proceed? Do I need to wrap the gcs_file and create an iterator on it or is there an easier way?

2 Answers 2

3

I think it's better you have your own wrapper/iterator designed for csv.reader. If gcs_file was to support Iterator protocol, it is not clear what next() should return to always accommodate its consumer.

According to csv reader doc, it

Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable. If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.

It expects a chunk of raw bytes from the underlying file, not necessarily a line. You can have a wrapper like this (not tested):

class CsvIterator(object)
  def __init__(self, gcs_file, chunk_size):
     self.gcs_file = gcs_file
     self.chunk_size = chunk_size
  def __iter__(self):
     return self
  def next(self):
     result = self.gcs_file.read(size=self.chunk_size)
     if not result:
        raise StopIteration()
     return result

The key is to read a chunk at a time so that when you have a large file, you don't blow up memory or experience timeout from urlfetch.

Or even simpler. To use iter built in:

csv.reader(iter(gcs_file.readline, ''))
Sign up to request clarification or add additional context in comments.

2 Comments

I am using csv_reader_reader = csv.reader(iter(gcs_file.readline, ''), delimiter=',', quotechar='"')and it works well.
Fixed. Note some changes before it requires 183 SDK. code.google.com/p/appengine-gcs-client/source/list
2

Try this:

from StringIO import StringIO
filename = '/<my_bucket/data.csv'
with gcs.open(filename, 'r') as gcs_file:
    csv_reader = csv.reader(StringIO(gcs_file.read()), delimiter=',',
                            quotechar='"')

This isn't ideal though. I've filed a feature request to have GCS files support iterating.

2 Comments

Thank you for filing the feature request. I think using the built in iter object works well. Thank you also for the StringIO idea.
I suggest use cSTringIO which is faster

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.