26

I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).

import os
import glob
import pandas as pd

src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))

# 1 - Takes 2 min to read 20m records from 30 files
for file_ in sorted(src_files):
    stage = pd.DataFrame()
    iter_csv = pd.read_csv(file_
                     , sep=','
                     , index_col=False
                     , header=0
                     , low_memory=False
                     , iterator=True
                     , chunksize=100000
                     , compression='gzip'
                     , memory_map=True
                     , encoding='utf-8')

    df = pd.concat([chunk for chunk in iter_csv])
    stage = stage.append(df, ignore_index=True)

# 2 - Takes 55 min to write 20m records from one dataframe
stage.to_csv('output.csv'
             , sep='|'
             , header=True
             , index=False
             , chunksize=100000
             , encoding='utf-8')

del stage

I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.

Thank you,

7
  • Hardware bottleneck. Keep a tab on your disk throughput, and also check for empty disk space. Commented Nov 17, 2016 at 16:46
  • As I mentioned, I did check the disk space and can copy large files to the drive with expected speed. Also, I should have mentioned I'm writing to an SSD (Samsung 950) Commented Nov 17, 2016 at 17:47
  • 4
    Try without the chunksize kwag... It could be a lot of things, like quoting, value conversion, etc. Try to profile it and see where it spends most of its time. Commented Nov 17, 2016 at 18:07
  • any update on that? I ran on a similar problem lately Commented Apr 23, 2017 at 18:50
  • i have an ssd on pci express and face the same issue. hardware should not be the bottleneck in this case... Commented Jun 12, 2017 at 15:16

6 Answers 6

17

Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)

stage.to_hdf(r'path/file.h5', key='stage', mode='w')

It is possible to save different data to different key names, so when we save the data, whatever key name we choose will need to be used when reading the data back.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html

Sign up to request clarification or add additional context in comments.

3 Comments

This solution works for me, while the .gz solution made no difference. .to_hdf method wrote out 1.5GB in 13 seconds. .to_csv took too long to time, even with changes suggested by Frane
Yes, the .gz solution made no difference made no difference for a file size of 5GB
I went from 4 minutes with .to_csv, to 8 seconds with .to_hdf !!!! Thanks @amir-f !!
14

You are reading compressed files and writing plaintext file. Could be IO bottleneck.

Writing compressed file could speedup writing up to 10x

    stage.to_csv('output.csv.gz'
         , sep='|'
         , header=True
         , index=False
         , chunksize=100000
         , compression='gzip'
         , encoding='utf-8')

Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).

7 Comments

Frane, solution did not work. The time taken still remained the same
@ShreeshaN what did you time? Execution of to_csv method or whole script?
Execution of to_csv
@ShreeshaN what size and time are you talking about? Look at hdf format for alternative. If you need text/csv look at stackoverflow.com/a/54617862/6646912 as mention in his comment o question.
thanks for the alternatives. I tried hdf format, to_hdf() is super fast. It took 4 seconds to save a 600mb file with 4 lakh records where as to_csv even after using chunks and compression took 220 seconds. Thanks
|
12

You said "[...] of mostly numeric (decimal) data.". Do you have any column with time and/or dates?

I saved an 8 GB CSV in seconds when it has only numeric/string values, but it takes 20 minutes to save an 500 MB CSV with two Dates columns. So, what I would recommend is to convert each date column to a string before saving it. The following command is enough:

df['Column'] = df['Column'].astype(str) 

I hope that this answer helps you.

P.S.: I understand that saving as a .hdf file solved the problem. But, sometimes, we do need a .csv file anyway.

2 Comments

Also to handle the NaT values you may want df['Column'].astype(str).replace('NaT', '')
This actually helped with my problem where I had a df of 1406x19221. The first row of mine was characters since I forgot to have the file read with them being the headers. When I made them the headers, I could save the file as csv in seconds. Thanks!
0

I used to use to_csv() to output to company network drive which was too slow and took one hour to output 1GB csv file. just tried to output to my laptop C: drive with to_csv() statement, it only took 2 mins to output 1GB csv file.

Comments

0

Try either Apache's parquet file format, or polars package, which is an alternative to the usual pandas.

I was trying to cache some data locally from my server, it has 59 millions rows on 9 columns; pandas.DataFrame.to_csv simply died therefore couldn't be timed.

I put a breakpoint on the way out and saved it down using parquet and read it back into polars dataframe (the reading wasn't timed but it was roughly 5-10 seconds):

[ins] In [6]:import polars as pl
[ins] In []:pf = pl.read_parquet('path_to_my_data.parquet') 

I wrote this huge dataframe to csv using polars:

[ins] In [8]: %timeit pf.write_csv('path_to_my_data.csv')                                                                         
24.3 s ± 5.79 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

I casted polars dataframe to a pandas one and wrote it down using both hdf and parquet:

[ins] In [9]: df = pf.to_pandas() 
[ins] In [11]: %timeit df.to_parquet('path_to_data2.parquet')                                                                                       
11.7 s ± 138 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [12]: %timeit df.to_hdf('path_to_my_data.h5', key="stage", mode="w")  
15.4 s ± 723 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The parquet file was 1.8G whereas the h5 file was 4.3G. to_parquet from pandas has performed compression (snappy, gzip, or brotil), however we as end users don't need to decompress it.

Either of them can be a promising, if not exceeding, alternative if you need to deal with huge amount of data and data query back and forth is a must.

Comments

0

Opening a file with a one MB buffer makes a significant difference when saving on a shared folder.

import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': range(1000000)})
df['A'] = df.loc[:, 'A'] * 2 * np.pi

file_name = r'\\shared_folder_path\blah.csv'
print(timeit.timeit(lambda: df.to_csv(file_name), number=1))
print(timeit.timeit(lambda: df.to_csv(file_name, chunksize=1000000), number=1))
print(timeit.timeit(lambda: df.to_csv(open(file_name, 'wb', 1000000)), number=1))
print(timeit.timeit(lambda: df.to_csv(open(file_name, 'wb', 1000000), chunksize=1000000), number=1))

Output:

59.76983120001387
61.62541880001663
6.958319600002142
9.22059939999599

Using a buffer helps while the chunksize parameter is detrimental.

When changing the file path to the local disc, we get the following output:

2.2724577999906614
2.2463568999955896
2.1668612000066787
2.2025332000048365

The impact of buffering is insignificant while the chunksize parameter is counterproductive, again.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.