I am having a baffling issue with pandas 'chunksize' parameter. I wrote a program in python that loops through a set of values and creates queries based on them. This data needs to be written to .csv files and sent to a colleague. The result of these queries is large, so the .csv need to be written chunk by chunk.
See to the following code:
values = [col1, col2, col3]
for col in values:
sql_query = "SELECT " + col + " other columns..." + " from big_table WHERE some condition..."
for chunk in pd.read_sql(sql_query, conn, chunksize = 80000):
chunk.to_csv(output_path + 'filename.csv', index=False, mode = 'a')
At first, I thought this program was working as the files were written with no issue. I decided to do a basic sanity check - comparing the number of rows in the raw query vs the number of lines in the file. They did not match.
I entered the sql_query, but using a count(*), directly into the database like so:
SELECT count(*) from big_table WHERE some condition;
result: ~1,500,000 rows
Then, I counted the rows in the file: ~1,500,020 rows
This was the same for every file. It seems the values were off by 20 - 30 rows. I am not sure how this is possible, because the queries should be being passed to the DB exactly as I have written them. Am I misunderstanding how 'chunksize' works in pandas? Is there a possible some chunks are overlapping or incomplete?
CSVfile? Try writing the file back to an empty table and see what you get?