Importing a large csv into DB using pandas

Question

I have a csv file with ~ 3 million records, that I want to migrate to sql server through my laptop (4GB ram).

pandas successfully reads the file to DataFrame (pd.read_csv()), but when I try to migrate (.to_sql()) I receive Memory Error:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-12-94c146c2b7b7> in <module>()
----> 1 csv.to_sql(name='stats', con=engine, if_exists='append')

C:\Python27\lib\site-packages\pandas\core\generic.pyc in to_sql(self, name, con, flavor, schema, if_exists, index, index_label, chunksize, dtype)
    964             self, name, con, flavor=flavor, schema=schema, if_exists=if_exists,
    965             index=index, index_label=index_label, chunksize=chunksize,
--> 966             dtype=dtype)
    967 
    968     def to_pickle(self, path):

C:\Python27\lib\site-packages\pandas\io\sql.pyc in to_sql(frame, name, con, flavor, schema, if_exists, index, index_label, chunksize, dtype)
    536     pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
    537                       index_label=index_label, schema=schema,
--> 538                       chunksize=chunksize, dtype=dtype)
    539 
    540 

C:\Python27\lib\site-packages\pandas\io\sql.pyc in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype)
   1170                          schema=schema, dtype=dtype)
   1171         table.create()
-> 1172         table.insert(chunksize)
   1173         # check for potentially case sensitivity issues (GH7815)
   1174         if name not in self.engine.table_names(schema=schema or self.meta.schema):

C:\Python27\lib\site-packages\pandas\io\sql.pyc in insert(self, chunksize)
    715 
    716                 chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
--> 717                 self._execute_insert(conn, keys, chunk_iter)
    718 
    719     def _query_iterator(self, result, chunksize, columns, coerce_float=True,

C:\Python27\lib\site-packages\pandas\io\sql.pyc in _execute_insert(self, conn, keys, data_iter)
    689 
    690     def _execute_insert(self, conn, keys, data_iter):
--> 691         data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    692         conn.execute(self.insert_statement(), data)
    693 

MemoryError:

Is there some other way that would let me successfully do migration?

You could chunk it so read 50k rows at a time, write to SQL and repeat, there is a chunksize param for read_csv — EdChum
– EdChum, Commented Jan 16, 2015 at 15:34
Actually there is a chunksize param for to_sql, by default this is None which means to write all the rows at once, could you try setting this to some value and see how you go — EdChum
– EdChum, Commented Jan 16, 2015 at 15:49

EdChum · Accepted Answer · 2015-01-16 15:51:05Z

7

I think you have 2 approaches:

Read the csv in chunks and then write to SQL DB and repeat
OR you can write in chunks to the DB

So for read_csv there is a chunksize param.

Equally there is also chunksize param for to_sql

answered Jan 16, 2015 at 15:51

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

theta Over a year ago

Excellent, it works: for data in pd.read_csv(..., chunksize=1000): data.to_sql(...). I didn't have any luck with .to_sql(chunksize=...) just to mention.

EdChum Over a year ago

@theta glad one of the approaches worked for you, I don't use SQL so don't know why that didn't work, it is supposed to

Collectives™ on Stack Overflow

Importing a large csv into DB using pandas

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related