Writing multiple CSV files concurrently using Threading

Question

I have a list that contains multiple dataframes. These dataframes can be quite large and take some time to write to csv files. I am trying to write them concurrently to csv files using pandas and tried to use multithreading to reduce the time. Why is the multithreading version taking more time than the sequential version? Is writing a file to csv with pandas not an IO Bound Process or am I not implementing it correctly?

Multithreading:

list_of_dfs = [df_a, df_b, df_c]

start = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = executor.map(lambda x: list_of_dfs[x].to_csv('Rough/'+str(x)+'.csv', index=False), range(0,3))
        
print(time.time()-start)
>>> 18.202364921569824

Sequential:

start = time.time()

for i in range(0,3):
    list_of_dfs[i].to_csv('Rough/'+str(i)+'.csv', index=False)
    
print(time.time() - start)
>>> 13.783314228057861

Jérôme Richard · Accepted Answer · 2021-03-25 22:35:20Z

I assume you use the usual CPython interpreter.

Why is the multithreading version taking more time than the sequential version?

The answers probably lies in the CPython Global Interpreter Lock (GIL).

Indeed, Pandas use the internal csv lib of CPython to write CSV files. However, AFAIK, the csv library (written in C) reads basic Python types from memory (so it is not aware of Numpy) and formats them into strings so that they can be written on your storage device. The access to CPython objects is protected by the GIL which prevent any speed-up (assuming most of the time is spent in accessing CPython objects).

Is writing a file to csv with pandas not an IO Bound Process or am I not implementing it correctly?

Writing CSV files using Pandas is clearly not IO bound on modern machines (with any decent SSD). The formatting process is very slow (integer and float conversions as well as string handling) and should takes most of the time. Moreover, it is interleaved with slow CPython object accesses. This explains why you should not get any speed-up.

Context switches between two threads often result in lower performance. You can more find information about this in the CPython documentation itself:

The GIL can degrade performance even when it is not a bottleneck. Summarizing the linked slides: The system call overhead is significant, especially on multicore hardware. Two threads calling a function may take twice as much time as a single thread calling the function twice. The GIL can cause I/O-bound threads to be scheduled ahead of CPU-bound threads, and it prevents signals from being delivered.

Mike Robinson · Accepted Answer · 2021-03-25 23:04:17Z

1

Also: "the ruling constraint" in this situation really is the hardware. If several different threads or processes are all trying to write data at the same time, you can inadvertently "put your poor, hard-working, public-servant disk drive into a monkey dance." It's probably better to write them one at a time, so that the disk's read/write head mechanism doesn't have to move around so much.

answered Mar 25, 2021 at 23:04

Mike Robinson

9,0516 gold badges33 silver badges50 bronze badges

Collectives™ on Stack Overflow

Writing multiple CSV files concurrently using Threading

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related