How to maximize the performance of multithreading in python

Question

I want to use multi-threading to get the longest substring among a set of strings. The following code works. HOWEVER, when I opened task manager, only around 40% of CPU was utilized. Why? How can I maximize the CPU power?

def longest_substring(s, t, score, j):
    match = difflib.SequenceMatcher(None, s, t).get_matching_blocks()
    char_num = []
    for i in match:
        char_num.append(i.size)
    score[j] = max(char_num)

for i in range(m):

    score = [None]*n
    s = df.loc[i, 'ocr']

    threads = [threading.Thread(target=longest_substring, args=(s, db.loc[j, 'ocr'], score, j)) for j in range(n)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

If you use threading you will only use one core of your cpu. Look at the multiprocessing module. — Yassine Faris
– Yassine Faris, Commented May 3, 2018 at 9:26

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Parallel Processing can be a little tricky, I give you few solutions below:

First: Python's GIL (Global Interpretation Lock) The usage you see can be of limited number of cores being utilised. Its because multithreads will not by default work at the same time, its because of Python's GIL. You can see the details here .

A global interpreter lock (GIL) is a mechanism used in computer-language interpreters to synchronize the execution of threads so that only one native thread can execute at a time. An interpreter that uses GIL always allows exactly one thread to execute at a time, even if run on a multi-core processor.

Applications running on implementations with a GIL can be designed to use separate processes to achieve full parallelism, as each process has its own interpreter and in turn has its own GIL. Otherwise, the GIL can be a significant barrier to parallelism.

To maximise your usage go for MultiProcessing in Python. Which will distribute your task on number of cores hence utilising the maximum CPU.

Second: Your Problem Size There is a trade off between the data size and CPU usage, if the threads are auto spawning your CPU usage will be at minimum possible, keeping a longer execution time. You can have a command over it exploiting all the CPU cores by playing with data size and see your optimal value and when you should start scaling.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered May 3, 2018 at 9:33

NoorJafri

1,85718 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mike Over a year ago

I tried threads = [multiprocessing.Process(target=longest_substring, args=(s, db.loc[j, 'ocr'], score, j)) for j in range(n)] but there is no increase in CPU power and the execution time is even longer.

NoorJafri Over a year ago

Seems ok to me. It should work. Just start the process captian. :D There is always a execution time trade off between a larger data chunk with higher parallelism and smaller data with higher parallelism. If the Data is big enough and you apply parallelism, that is only when you will get benefit out of it.

Collectives™ on Stack Overflow

How to maximize the performance of multithreading in python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related