Multiprocessing vs Threading in Python

Question

I am learning Multiprocessing and Threading in python to process and create large amount of files, the diagram is shown here diagram

Each of output file depends on the analysis of all input files.

Single processing of the program takes quite a long time, so I tried the following codes:

(a) multiprocessing

start = time.time()
process_count = cpu_count()
p = Pool(process_count)
for i in range(process_count):
    p.apply_async(my_read_process_and_write_func, args=(i,w))

p.close()
p.join()
end = time.time()

(b) threading

start = time.time()
thread_count = cpu_count()
thread_list = [] 

for i in range(0, thread_count):
    t = threading.Thread(target=my_read_process_and_write_func, args=(i,))
    thread_list.append(t)

for t in thread_list:
    t.start()

for t in thread_list:
    t.join()

end = time.time()

I am runing these codes using Python 3.6 on a Windows PC with 8 cores. However Multiprocessing method takes about the same time as the single-processing method, and Threading method takes about 75% of the single-processing method.

My questions are:

Are my codes correct?

Is there any better way/codes to improve the efficiency? Thanks!

Tarik · Accepted Answer · 2021-03-30 03:18:23Z

3

Your processing is I/O bound, not CPU bound. As a result, the fact that you have multiple processes helps little. Each Python process in multiprocessing is stuck waiting for input or output while the CPU does nothing. Increasing the Pool size in multiprocessing should improve performance.

answered Mar 30, 2021 at 3:18

Tarik

11.2k2 gold badges31 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Eric Tian Over a year ago

Thanks Tarik, your answer helped a lot!

Eric Tian Over a year ago

Answer accepted, I have provided my current solution below. Please enlighten me if there are better methods, thanks.

Eric Tian · Accepted Answer · 2021-03-30 09:28:24Z

1

Follwing Tarik's answer, since my processing is I/O bound, I made serveral copies of input files, then each processing reads and processes different copy of these files. Now my codes run 8 times faster.

answered Mar 30, 2021 at 9:28

Eric Tian

431 silver badge5 bronze badges

4 Comments

Tarik Over a year ago

You mean you have identical copies of the input files? Are you opening the input files in read only mode?

Eric Tian Over a year ago

Yes. I tried to increase the Pool size in multiprocessing, but there's no big difference. Since my task is I/O bound, I copied the input files, and each process reads corresponding copy of input files to generate different output files.

Tarik Over a year ago

You got me curious with "I made serveral copies of input files" because it should not have any effect, the reason being that once read by the operating system, the file blocks will be cached in system memory, unless of course the input file is so large as to not allow caching. In the case the multiple processes are processing the same large input file, I would have s main process read the file in sequence and feed the worker processes with the data that has been read. If the input file is small, then it will probably be entirely in the cache all the time.

Eric Tian Over a year ago

The input files are quite large in my opinion (330MB x more than 100 files), I have posted my current solution with a diagram.

Eric Tian · Accepted Answer · 2021-04-02 03:36:24Z

0

Now my processing diagram looks like this. My input files include one index file (about 400MB) and 100 other files(each size=330MB, can be considered as a file pool). In order to generate one output file, index file and all flles within the file pool need to be read. (e.g. First line of index file is 15, then line 15 of each files within the file pool need to be read to generate output file1.) Previously I tried multiprocessing and Threading without making copies, the codes were very slow. Then I optimized the codes by making copies of only the index file for each processing, so each processing reads copies of index file individually, and then reads the file pool to generate the output files. Currently, with 8 cpu cores, multiprocessing with poolsize=8 takes least time.

answered Apr 2, 2021 at 3:36

Eric Tian

431 silver badge5 bronze badges

4 Comments

Tarik Over a year ago

This is what puzzles me: why several copies of the same file perform better than a single copy, since a single copy when being read will be cached in memory by the operating system. Are you opening the index file in read only mode? Sorry for bothering but I am really wondering why and I might understand something that I do not know about.

Eric Tian Over a year ago

Yes, I use "with open(index_file) as index_reader" to read the index file, I think it is read only mode by default.

Eric Tian Over a year ago

I don't know much about memory cache. Are you suggesting that if I use multiprocessing to read a single index file, more than one process is able to read the cached index file at the same time?

Tarik Over a year ago

A bit late to answer. Yes more than one process can read the same file at the same time. By memory cache, I mean that the operating system keeps the data that it reads from the disk in system memory for a while. If the same file is read another time, the data is transfered from system memory to the process directly wihout hitting the disk. However, as more and more data is read from or written to the disk, the data cached in system memory might be discarded to make room for new data read from the disk. See en.wikipedia.org/wiki/Cache_(computing) - Disk Cache section

Collectives™ on Stack Overflow

Multiprocessing vs Threading in Python

3 Answers 3

2 Comments

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related