Incorrect number of rows when using pandas chunksize and postgreSQL

Question

I am having a baffling issue with pandas 'chunksize' parameter. I wrote a program in python that loops through a set of values and creates queries based on them. This data needs to be written to .csv files and sent to a colleague. The result of these queries is large, so the .csv need to be written chunk by chunk.

See to the following code:

values = [col1, col2, col3]

for col in values:
   sql_query = "SELECT " + col + " other columns..."  + " from big_table WHERE some condition..."

for chunk in pd.read_sql(sql_query, conn, chunksize = 80000):
    chunk.to_csv(output_path + 'filename.csv', index=False, mode = 'a')

At first, I thought this program was working as the files were written with no issue. I decided to do a basic sanity check - comparing the number of rows in the raw query vs the number of lines in the file. They did not match.

I entered the sql_query, but using a count(*), directly into the database like so:

SELECT  count(*) from big_table WHERE some condition;

result: ~1,500,000 rows

Then, I counted the rows in the file: ~1,500,020 rows

This was the same for every file. It seems the values were off by 20 - 30 rows. I am not sure how this is possible, because the queries should be being passed to the DB exactly as I have written them. Am I misunderstanding how 'chunksize' works in pandas? Is there a possible some chunks are overlapping or incomplete?

Wrapped lines in the CSV file? Try writing the file back to an empty table and see what you get? — Adrian Klaver
– Adrian Klaver, Commented Jan 9, 2022 at 1:19

jjanes · Accepted Answer · 2022-01-09 03:11:29Z

2

Each chunk gets its own header line. You would need to set header=False for all chunks but the first. Or for all chunks, whatever you wish.

Better yet, just use python directly and bypass pandas, then you won't need to do it in chunks in the first place, and it should be much faster.

answered Jan 9, 2022 at 3:11

jjanes

44.9k5 gold badges39 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Incorrect number of rows when using pandas chunksize and postgreSQL

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related