Pandas to_csv() slow saving large dataframe

Question

I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).

import os
import glob
import pandas as pd

src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))

# 1 - Takes 2 min to read 20m records from 30 files
for file_ in sorted(src_files):
    stage = pd.DataFrame()
    iter_csv = pd.read_csv(file_
                     , sep=','
                     , index_col=False
                     , header=0
                     , low_memory=False
                     , iterator=True
                     , chunksize=100000
                     , compression='gzip'
                     , memory_map=True
                     , encoding='utf-8')

    df = pd.concat([chunk for chunk in iter_csv])
    stage = stage.append(df, ignore_index=True)

# 2 - Takes 55 min to write 20m records from one dataframe
stage.to_csv('output.csv'
             , sep='|'
             , header=True
             , index=False
             , chunksize=100000
             , encoding='utf-8')

del stage

I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.

Thank you,

Hardware bottleneck. Keep a tab on your disk throughput, and also check for empty disk space. — Kartik
– Kartik, Commented Nov 17, 2016 at 16:46
As I mentioned, I did check the disk space and can copy large files to the drive with expected speed. Also, I should have mentioned I'm writing to an SSD (Samsung 950) — Kimi Merroll
– Kimi Merroll, Commented Nov 17, 2016 at 17:47
Try without the chunksize kwag... It could be a lot of things, like quoting, value conversion, etc. Try to profile it and see where it spends most of its time. — Kartik
– Kartik, Commented Nov 17, 2016 at 18:07
i have an ssd on pci express and face the same issue. hardware should not be the bottleneck in this case... — PlagTag
– PlagTag, Commented Jun 12, 2017 at 15:16

s2t2 · Accepted Answer · 2023-11-26 22:27:18Z

17

Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)

stage.to_hdf(r'path/file.h5', key='stage', mode='w')

It is possible to save different data to different key names, so when we save the data, whatever key name we choose will need to be used when reading the data back.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html

edited Nov 26, 2023 at 22:27

s2t2

2,7245 gold badges41 silver badges56 bronze badges

answered Dec 19, 2018 at 18:04

Amir F

2,56921 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

alliedtoasters Over a year ago

This solution works for me, while the .gz solution made no difference. .to_hdf method wrote out 1.5GB in 13 seconds. .to_csv took too long to time, even with changes suggested by Frane

Hardik Gupta Over a year ago

Yes, the .gz solution made no difference made no difference for a file size of 5GB

Paul Over a year ago

I went from 4 minutes with .to_csv, to 8 seconds with .to_hdf !!!! Thanks @amir-f !!

Frane · Accepted Answer · 2017-09-05 10:34:25Z

14

You are reading compressed files and writing plaintext file. Could be IO bottleneck.

Writing compressed file could speedup writing up to 10x

    stage.to_csv('output.csv.gz'
         , sep='|'
         , header=True
         , index=False
         , chunksize=100000
         , compression='gzip'
         , encoding='utf-8')

Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).

answered Sep 5, 2017 at 10:34

Frane

5346 silver badges13 bronze badges

7 Comments

Shreesha N Over a year ago

Frane, solution did not work. The time taken still remained the same

Frane Over a year ago

@ShreeshaN what did you time? Execution of to_csv method or whole script?

Shreesha N Over a year ago

Execution of to_csv

Frane Over a year ago

@ShreeshaN what size and time are you talking about? Look at hdf format for alternative. If you need text/csv look at stackoverflow.com/a/54617862/6646912 as mention in his comment o question.

Shreesha N Over a year ago

thanks for the alternatives. I tried hdf format, to_hdf() is super fast. It took 4 seconds to save a 600mb file with 4 lakh records where as to_csv even after using chunks and compression took 220 seconds. Thanks

|

Kevin · Accepted Answer · 2019-10-03 02:52:07Z

12

You said "[...] of mostly numeric (decimal) data.". Do you have any column with time and/or dates?

I saved an 8 GB CSV in seconds when it has only numeric/string values, but it takes 20 minutes to save an 500 MB CSV with two Dates columns. So, what I would recommend is to convert each date column to a string before saving it. The following command is enough:

df['Column'] = df['Column'].astype(str)

I hope that this answer helps you.

P.S.: I understand that saving as a .hdf file solved the problem. But, sometimes, we do need a .csv file anyway.

edited Oct 3, 2019 at 2:52

Kevin

18.8k8 gold badges71 silver badges86 bronze badges

answered Oct 3, 2019 at 2:41

lucas F

3813 silver badges6 bronze badges

2 Comments

Evan Jones Over a year ago

Also to handle the NaT values you may want df['Column'].astype(str).replace('NaT', '')

Apollonia Vitelli Over a year ago

This actually helped with my problem where I had a df of 1406x19221. The first row of mine was characters since I forgot to have the file read with them being the headers. When I made them the headers, I could save the file as csv in seconds. Thanks!

Syscall · Accepted Answer · 2023-01-26 18:51:42Z

0

I used to use to_csv() to output to company network drive which was too slow and took one hour to output 1GB csv file. just tried to output to my laptop C: drive with to_csv() statement, it only took 2 mins to output 1GB csv file.

edited Jan 26, 2023 at 18:51

Syscall

19.8k10 gold badges44 silver badges60 bronze badges

answered Jan 26, 2023 at 18:03

Bo Knows

1

Comments

stucash · Accepted Answer · 2023-03-05 13:31:29Z

Try either Apache's parquet file format, or polars package, which is an alternative to the usual pandas.

I was trying to cache some data locally from my server, it has 59 millions rows on 9 columns; pandas.DataFrame.to_csv simply died therefore couldn't be timed.

I put a breakpoint on the way out and saved it down using parquet and read it back into polars dataframe (the reading wasn't timed but it was roughly 5-10 seconds):

[ins] In [6]:import polars as pl
[ins] In []:pf = pl.read_parquet('path_to_my_data.parquet')

I wrote this huge dataframe to csv using polars:

[ins] In [8]: %timeit pf.write_csv('path_to_my_data.csv')                                                                         
24.3 s ± 5.79 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

I casted polars dataframe to a pandas one and wrote it down using both hdf and parquet:

[ins] In [9]: df = pf.to_pandas() 
[ins] In [11]: %timeit df.to_parquet('path_to_data2.parquet')                                                                                       
11.7 s ± 138 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [12]: %timeit df.to_hdf('path_to_my_data.h5', key="stage", mode="w")  
15.4 s ± 723 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The parquet file was 1.8G whereas the h5 file was 4.3G. to_parquet from pandas has performed compression (snappy, gzip, or brotil), however we as end users don't need to decompress it.

Either of them can be a promising, if not exceeding, alternative if you need to deal with huge amount of data and data query back and forth is a must.

Tarik · Accepted Answer · 2024-08-26 11:21:42Z

Opening a file with a one MB buffer makes a significant difference when saving on a shared folder.

import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': range(1000000)})
df['A'] = df.loc[:, 'A'] * 2 * np.pi

file_name = r'\\shared_folder_path\blah.csv'
print(timeit.timeit(lambda: df.to_csv(file_name), number=1))
print(timeit.timeit(lambda: df.to_csv(file_name, chunksize=1000000), number=1))
print(timeit.timeit(lambda: df.to_csv(open(file_name, 'wb', 1000000)), number=1))
print(timeit.timeit(lambda: df.to_csv(open(file_name, 'wb', 1000000), chunksize=1000000), number=1))

Output:

59.76983120001387
61.62541880001663
6.958319600002142
9.22059939999599

Using a buffer helps while the chunksize parameter is detrimental.

When changing the file path to the local disc, we get the following output:

2.2724577999906614
2.2463568999955896
2.1668612000066787
2.2025332000048365

The impact of buffering is insignificant while the chunksize parameter is counterproductive, again.

Collectives™ on Stack Overflow

Pandas to_csv() slow saving large dataframe

6 Answers 6

3 Comments

7 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

7 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related