1

I have 30 csv files. Each file has 200,000 row and 10 columns.
I want to read these files and do some process. Below is the code without multi-thread:

import os
import time

csv_dir = './csv'
csv_save_dir = './save_csv'
csv_files = os.listdir(csv_dir)

if __name__ == '__main__':
    if not os.path.exists(csv_save_dir):
        os.makedirs(csv_save_dir)
    start = time.perf_counter()

    for csv_file in csv_files:
        csv_file_path = os.path.join(csv_dir,csv_file)
        with open(csv_file_path,'r') as f:
            lines = f.readlines()
    
        csv_file_save_path = os.path.join(csv_save_dir,'1_'+csv_file)
        with open(csv_file_save_path,'w') as f:
            f.writelines(lines[:20])
        print(f'CSV File saved...')
    
    finish = time.perf_counter()

    print(f'Finished in {round(finish-start, 2)} second(s)')

The elapsed time of the above code is about 7 seconds. This time, I modified the above code with multi-thread. The code is as follows:

import os
import time
import concurrent.futures

csv_dir = './csv'
csv_save_dir = './save_csv'
csv_files = os.listdir(csv_dir)

def read_and_write_csv(csv_file):
    csv_file_path = os.path.join(csv_dir,csv_file)
    with open(csv_file_path,'r') as f:
        lines = f.readlines()

    csv_file_save_path = os.path.join(csv_save_dir,'1_'+csv_file)
    with open(csv_file_save_path,'w') as f:
        f.writelines(lines[:20])
    print(f'CSV File saved...')

if __name__ == '__main__':
    if not os.path.exists(csv_save_dir):
        os.makedirs(csv_save_dir)

    start = time.perf_counter()
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        executor.map(read_and_write_csv, [csv_file for csv_file in csv_files])
    finish = time.perf_counter()

    print(f'Finished in {round(finish-start, 2)} second(s)')

I expected the above code take less time than my first code because of using multi-threads. But the elapsed time is about 7 seconds!!

Is there way to speed up using multi-threads?
7
  • 1
    Why do you read the whole file if you only need the first twenty lines? That's the most obvious improvement to your code. Also, where does your setup spend its time, is it doing IO or processing (CPU/RAM)? Knowing your bottlenecks is the first step to improving performance. Commented May 14, 2021 at 16:29
  • In fact, writing the first twenty lines is just for test. As far as I know, reading and writing csv file is IO operation so I thought that multi thread is valid. However, there is no improvement.... Commented May 14, 2021 at 16:45
  • 1
    Yes and no. Python can't make use of multiple CPU (cores) using multithreading. However, it can make use of multiple IO channels using multithreading. How many storage devices are involved in your setup? I guess it's just one and pounding that with multiple threads isn't going to make it run faster. Commented May 14, 2021 at 16:49
  • 1
    If C and D are separate drives and not just different partitions on the same drive, then yes. Commented May 14, 2021 at 19:01
  • 1
    BTW: Did you try the ProcessPoolExecutor? Maybe it's that simple. Commented May 14, 2021 at 19:07

1 Answer 1

1

I don't agree with the comments. Python will happily use multiple CPU cores if you have them, executing threads on separate cores.

What I think is the issue here is your test. If you added the "do some process" you mentioned to your thread workers, I think you may find the multi-thread version to be faster. Right now your test merely shows it takes about 7 seconds to read/write the CSV files which will be I/O locked and not take advantage of the CPUs.

If your "do some process" is non-trivial, I'd use multi-threading differently. Right now, you are having each thread do:

read csv file
process csv file
save csv file

This way, you are getting thread lock during the read and save steps, slowing things down.

For a non-trivial "process" step, I'd do this: (pseudo-code)

def process_csv(line):
    <perform your processing on a single line>

<main>:
    csv_file for csv_file in csv_files:
        <read lines from csv>

        with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
            executor.map(process_csv, [line for line in lines])

        <write lines out to csv>

Since you're locking on read/write anyway, here at least the work-per-line is being spread across cores. And you're not trying to read all CSV's into memory simultaneously. Pick max-workers value appropriate for the number of cores in your system.

If "do some process" is trivial, my suggestion is probably pointless.

Sign up to request clarification or add additional context in comments.

6 Comments

Python has its GIL, the Global Interpreter Lock. No, it can't run python code in different threads of execution concurrently. This is a well-known problem and the reason people use multiprocessing instead. Note that IO (network, storage) is usually executed in a way that the GIL is released before running the low-level, native code. The same is done by some extensions (I think scipy) use similar methods to make use of multiple cores. Still, threads in Python are limited.
@Kevin Thank you for kind explanation. I'll refer to your words.
@UlrichEckhardt As far as I know, if someone do some task such as downloading images from URL, then multi-thread is valid. Is there difference between download images and reading/wring csv?
Think about resources involved in a task. For a download, you have various network nodes from your machine to the remote's, plus their storage for the image and a bit of CPU. When one of these resources is maxed out, using operations in parallel won't make things go faster. If doing parallel operations allows use of additional resources (another CPU core, another CDN node bedind a load balancer), then you can improve throughput.
It just occurred to me that your question could have a different meaning, so perhaps ignore the above, which is still true but not directly relevant to Python. Concerning downloads compared to CSV reading/writing, yes, some blocking network operations also release the GIL to allow other Python code to run in parallel. Both things are IO (input/output).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.