How does the chunksize parameter in pandas.read_sql() avoid loading data into memory

Question

I'm iterating through the results of pd.read_sql(query, engine, chunksize=10000)

I'm doing this with engine (sqlalchemy) set to echo=True so that it prints out the raw sql commands that Pandas is hitting the db (postgres) with.

The printouts show that Pandas hits the db only once with exactly the query I wrote, without any modifications. With this in mind, how is it possible for Pandas to iterate through the full output of that query in chunks, while also not storing all chunks in memory at once?

Please post a more complete example. One possibility is that it reads the entire dataset into memory in C, then parcels it out to python in smaller chunks. — jjanes
– jjanes, Commented Jan 2, 2020 at 19:32
I guess it is read into dataframe of chunksize - with fetch - and stored there until something is done with the data - then fetch another chunk etc. — Bjarni Ragnarsson
– Bjarni Ragnarsson, Commented Jan 2, 2020 at 23:35

Oliver Rice · Accepted Answer · 2020-01-04 15:44:34Z

1

The single SQL query makes the database aware of which results it needs to return.

Actually returning the results is handled by the communication protocol that your driver (probably psycopg2 for python) handles.

That protocol allows for streaming result sets. Those results can then be chunked at either the driver and/or pandas layer without executing multiple SQL statements.

answered Jan 4, 2020 at 15:44

Oliver Rice

93310 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How does the chunksize parameter in pandas.read_sql() avoid loading data into memory

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related