3

I am learning Multiprocessing and Threading in python to process and create large amount of files, the diagram is shown here diagram

Each of output file depends on the analysis of all input files.

Single processing of the program takes quite a long time, so I tried the following codes:

(a) multiprocessing

start = time.time()
process_count = cpu_count()
p = Pool(process_count)
for i in range(process_count):
    p.apply_async(my_read_process_and_write_func, args=(i,w))

p.close()
p.join()
end = time.time()

(b) threading

start = time.time()
thread_count = cpu_count()
thread_list = [] 

for i in range(0, thread_count):
    t = threading.Thread(target=my_read_process_and_write_func, args=(i,))
    thread_list.append(t)

for t in thread_list:
    t.start()

for t in thread_list:
    t.join()

end = time.time()

I am runing these codes using Python 3.6 on a Windows PC with 8 cores. However Multiprocessing method takes about the same time as the single-processing method, and Threading method takes about 75% of the single-processing method.

My questions are:

Are my codes correct?

Is there any better way/codes to improve the efficiency? Thanks!

3 Answers 3

3

Your processing is I/O bound, not CPU bound. As a result, the fact that you have multiple processes helps little. Each Python process in multiprocessing is stuck waiting for input or output while the CPU does nothing. Increasing the Pool size in multiprocessing should improve performance.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Tarik, your answer helped a lot!
Answer accepted, I have provided my current solution below. Please enlighten me if there are better methods, thanks.
1

Follwing Tarik's answer, since my processing is I/O bound, I made serveral copies of input files, then each processing reads and processes different copy of these files. Now my codes run 8 times faster.

4 Comments

You mean you have identical copies of the input files? Are you opening the input files in read only mode?
Yes. I tried to increase the Pool size in multiprocessing, but there's no big difference. Since my task is I/O bound, I copied the input files, and each process reads corresponding copy of input files to generate different output files.
You got me curious with "I made serveral copies of input files" because it should not have any effect, the reason being that once read by the operating system, the file blocks will be cached in system memory, unless of course the input file is so large as to not allow caching. In the case the multiple processes are processing the same large input file, I would have s main process read the file in sequence and feed the worker processes with the data that has been read. If the input file is small, then it will probably be entirely in the cache all the time.
The input files are quite large in my opinion (330MB x more than 100 files), I have posted my current solution with a diagram.
0

Now my processing diagram looks like this. multiprocessing My input files include one index file (about 400MB) and 100 other files(each size=330MB, can be considered as a file pool). In order to generate one output file, index file and all flles within the file pool need to be read. (e.g. First line of index file is 15, then line 15 of each files within the file pool need to be read to generate output file1.) Previously I tried multiprocessing and Threading without making copies, the codes were very slow. Then I optimized the codes by making copies of only the index file for each processing, so each processing reads copies of index file individually, and then reads the file pool to generate the output files. Currently, with 8 cpu cores, multiprocessing with poolsize=8 takes least time.

4 Comments

This is what puzzles me: why several copies of the same file perform better than a single copy, since a single copy when being read will be cached in memory by the operating system. Are you opening the index file in read only mode? Sorry for bothering but I am really wondering why and I might understand something that I do not know about.
Yes, I use "with open(index_file) as index_reader" to read the index file, I think it is read only mode by default.
I don't know much about memory cache. Are you suggesting that if I use multiprocessing to read a single index file, more than one process is able to read the cached index file at the same time?
A bit late to answer. Yes more than one process can read the same file at the same time. By memory cache, I mean that the operating system keeps the data that it reads from the disk in system memory for a while. If the same file is read another time, the data is transfered from system memory to the process directly wihout hitting the disk. However, as more and more data is read from or written to the disk, the data cached in system memory might be discarded to make room for new data read from the disk. See en.wikipedia.org/wiki/Cache_(computing) - Disk Cache section

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.