How to speed up read and write csv with multi thread - python

Question

I have 30 csv files. Each file has 200,000 row and 10 columns.
I want to read these files and do some process. Below is the code without multi-thread:

import os
import time

csv_dir = './csv'
csv_save_dir = './save_csv'
csv_files = os.listdir(csv_dir)

if __name__ == '__main__':
    if not os.path.exists(csv_save_dir):
        os.makedirs(csv_save_dir)
    start = time.perf_counter()

    for csv_file in csv_files:
        csv_file_path = os.path.join(csv_dir,csv_file)
        with open(csv_file_path,'r') as f:
            lines = f.readlines()
    
        csv_file_save_path = os.path.join(csv_save_dir,'1_'+csv_file)
        with open(csv_file_save_path,'w') as f:
            f.writelines(lines[:20])
        print(f'CSV File saved...')
    
    finish = time.perf_counter()

    print(f'Finished in {round(finish-start, 2)} second(s)')

The elapsed time of the above code is about 7 seconds. This time, I modified the above code with multi-thread. The code is as follows:

import os
import time
import concurrent.futures

csv_dir = './csv'
csv_save_dir = './save_csv'
csv_files = os.listdir(csv_dir)

def read_and_write_csv(csv_file):
    csv_file_path = os.path.join(csv_dir,csv_file)
    with open(csv_file_path,'r') as f:
        lines = f.readlines()

    csv_file_save_path = os.path.join(csv_save_dir,'1_'+csv_file)
    with open(csv_file_save_path,'w') as f:
        f.writelines(lines[:20])
    print(f'CSV File saved...')

if __name__ == '__main__':
    if not os.path.exists(csv_save_dir):
        os.makedirs(csv_save_dir)

    start = time.perf_counter()
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        executor.map(read_and_write_csv, [csv_file for csv_file in csv_files])
    finish = time.perf_counter()

    print(f'Finished in {round(finish-start, 2)} second(s)')

I expected the above code take less time than my first code because of using multi-threads. But the elapsed time is about 7 seconds!!

Is there way to speed up using multi-threads?

Why do you read the whole file if you only need the first twenty lines? That's the most obvious improvement to your code. Also, where does your setup spend its time, is it doing IO or processing (CPU/RAM)? Knowing your bottlenecks is the first step to improving performance. — Ulrich Eckhardt
– Ulrich Eckhardt, Commented May 14, 2021 at 16:29
In fact, writing the first twenty lines is just for test. As far as I know, reading and writing csv file is IO operation so I thought that multi thread is valid. However, there is no improvement.... — Lover Math
– Lover Math, Commented May 14, 2021 at 16:45
Yes and no. Python can't make use of multiple CPU (cores) using multithreading. However, it can make use of multiple IO channels using multithreading. How many storage devices are involved in your setup? I guess it's just one and pounding that with multiple threads isn't going to make it run faster. — Ulrich Eckhardt
– Ulrich Eckhardt, Commented May 14, 2021 at 16:49
If C and D are separate drives and not just different partitions on the same drive, then yes. — Ulrich Eckhardt
– Ulrich Eckhardt, Commented May 14, 2021 at 19:01
BTW: Did you try the ProcessPoolExecutor? Maybe it's that simple. — Ulrich Eckhardt
– Ulrich Eckhardt, Commented May 14, 2021 at 19:07

Kevin · Accepted Answer · 2021-05-14 18:26:31Z

1

I don't agree with the comments. Python will happily use multiple CPU cores if you have them, executing threads on separate cores.

What I think is the issue here is your test. If you added the "do some process" you mentioned to your thread workers, I think you may find the multi-thread version to be faster. Right now your test merely shows it takes about 7 seconds to read/write the CSV files which will be I/O locked and not take advantage of the CPUs.

If your "do some process" is non-trivial, I'd use multi-threading differently. Right now, you are having each thread do:

read csv file
process csv file
save csv file

This way, you are getting thread lock during the read and save steps, slowing things down.

For a non-trivial "process" step, I'd do this: (pseudo-code)

def process_csv(line):
    <perform your processing on a single line>

<main>:
    csv_file for csv_file in csv_files:
        <read lines from csv>

        with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
            executor.map(process_csv, [line for line in lines])

        <write lines out to csv>

Since you're locking on read/write anyway, here at least the work-per-line is being spread across cores. And you're not trying to read all CSV's into memory simultaneously. Pick max-workers value appropriate for the number of cores in your system.

If "do some process" is trivial, my suggestion is probably pointless.

answered May 14, 2021 at 18:26

Kevin

4083 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Ulrich Eckhardt Over a year ago

Python has its GIL, the Global Interpreter Lock. No, it can't run python code in different threads of execution concurrently. This is a well-known problem and the reason people use multiprocessing instead. Note that IO (network, storage) is usually executed in a way that the GIL is released before running the low-level, native code. The same is done by some extensions (I think scipy) use similar methods to make use of multiple cores. Still, threads in Python are limited.

Lover Math Over a year ago

@Kevin Thank you for kind explanation. I'll refer to your words.

Lover Math Over a year ago

@UlrichEckhardt As far as I know, if someone do some task such as downloading images from URL, then multi-thread is valid. Is there difference between download images and reading/wring csv?

Ulrich Eckhardt Over a year ago

Think about resources involved in a task. For a download, you have various network nodes from your machine to the remote's, plus their storage for the image and a bit of CPU. When one of these resources is maxed out, using operations in parallel won't make things go faster. If doing parallel operations allows use of additional resources (another CPU core, another CDN node bedind a load balancer), then you can improve throughput.

Ulrich Eckhardt Over a year ago

It just occurred to me that your question could have a different meaning, so perhaps ignore the above, which is still true but not directly relevant to Python. Concerning downloads compared to CSV reading/writing, yes, some blocking network operations also release the GIL to allow other Python code to run in parallel. Both things are IO (input/output).

|

Collectives™ on Stack Overflow

How to speed up read and write csv with multi thread - python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related