0

I want to use multi-threading to get the longest substring among a set of strings. The following code works. HOWEVER, when I opened task manager, only around 40% of CPU was utilized. Why? How can I maximize the CPU power?

def longest_substring(s, t, score, j):
    match = difflib.SequenceMatcher(None, s, t).get_matching_blocks()
    char_num = []
    for i in match:
        char_num.append(i.size)
    score[j] = max(char_num)

for i in range(m):

    score = [None]*n
    s = df.loc[i, 'ocr']

    threads = [threading.Thread(target=longest_substring, args=(s, db.loc[j, 'ocr'], score, j)) for j in range(n)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
1
  • If you use threading you will only use one core of your cpu. Look at the multiprocessing module. Commented May 3, 2018 at 9:26

1 Answer 1

1

Parallel Processing can be a little tricky, I give you few solutions below:

First: Python's GIL (Global Interpretation Lock) The usage you see can be of limited number of cores being utilised. Its because multithreads will not by default work at the same time, its because of Python's GIL. You can see the details here .

A global interpreter lock (GIL) is a mechanism used in computer-language interpreters to synchronize the execution of threads so that only one native thread can execute at a time. An interpreter that uses GIL always allows exactly one thread to execute at a time, even if run on a multi-core processor.

Applications running on implementations with a GIL can be designed to use separate processes to achieve full parallelism, as each process has its own interpreter and in turn has its own GIL. Otherwise, the GIL can be a significant barrier to parallelism.

To maximise your usage go for MultiProcessing in Python. Which will distribute your task on number of cores hence utilising the maximum CPU.

Second: Your Problem Size There is a trade off between the data size and CPU usage, if the threads are auto spawning your CPU usage will be at minimum possible, keeping a longer execution time. You can have a command over it exploiting all the CPU cores by playing with data size and see your optimal value and when you should start scaling.

Sign up to request clarification or add additional context in comments.

2 Comments

I tried threads = [multiprocessing.Process(target=longest_substring, args=(s, db.loc[j, 'ocr'], score, j)) for j in range(n)] but there is no increase in CPU power and the execution time is even longer.
Seems ok to me. It should work. Just start the process captian. :D There is always a execution time trade off between a larger data chunk with higher parallelism and smaller data with higher parallelism. If the Data is big enough and you apply parallelism, that is only when you will get benefit out of it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.