5

I'm iterating through the results of pd.read_sql(query, engine, chunksize=10000)

I'm doing this with engine (sqlalchemy) set to echo=True so that it prints out the raw sql commands that Pandas is hitting the db (postgres) with.

The printouts show that Pandas hits the db only once with exactly the query I wrote, without any modifications. With this in mind, how is it possible for Pandas to iterate through the full output of that query in chunks, while also not storing all chunks in memory at once?

2
  • 2
    Please post a more complete example. One possibility is that it reads the entire dataset into memory in C, then parcels it out to python in smaller chunks. Commented Jan 2, 2020 at 19:32
  • I guess it is read into dataframe of chunksize - with fetch - and stored there until something is done with the data - then fetch another chunk etc. Commented Jan 2, 2020 at 23:35

1 Answer 1

1

The single SQL query makes the database aware of which results it needs to return.

Actually returning the results is handled by the communication protocol that your driver (probably psycopg2 for python) handles.

That protocol allows for streaming result sets. Those results can then be chunked at either the driver and/or pandas layer without executing multiple SQL statements.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.