2

I have a dictionary (in memory) data that has ~ 10,000 keys which each key represent a stock ticker, and the value stores the pandas dataframe representation of time series data for daily stock price. I am trying to calculate the pairwise Pearson correlation.

The code takes a long time ~3 hr to fully iterate through all the combinations O(n^2) ~ C(2, 10000). I tried to use multiprocessing dummy package but saw no performance gain AT ALL (actually slower as the number of workers increases).

from multiprocessing.dummy import Pool

def calculate_correlation((t1, t2)):
    # pseudo code here
    return pearsonr(data[t1]['Close'], data[t2]['Close'])

todos = []

for idx, t1 in enumerate(list(data.keys())):
    for t2 in list(data.keys())[idx:]:  # only the matrix top triangle
        todos.append((t1, t2))

pool = Pool(4)
results = pool.map(calculate_correlation, todos)
pool.close()
pool.join()

All the data has been loaded into memory so it should not be IO intensive. Is there any reason that why there is no performance gain at all?

1 Answer 1

4

When you use multiprocessing.dummy, you're using threads, not processes. For a CPU-bound application in Python, you are usually not going to get performance boost when using multi-threading. You should use multi-processing instead to parallelize your code in Python. So, if you change your code from

from multiprocessing.dummy import Pool 

to

from multiprocessing import Pool

This should substantially improve your performance.

The above will fix your problem, but if you want to know why this happened. Please continue reading:

Multi-threading in Python has Global Interpreter Lock (GIL) that prevents two threads in the same process to run at the same time. If you had a a lot of disk IO happening, multi-threading would have helped because DISK IO is separate process that can handle locks. Or, if you had a separate application used by your Python code that can handle locks, multi-threading would have helped. Multi-processing, on the other hand, will use all the cores of your CPU as separate processes as opposed to multi-threading. In CPU bound Python application such as yours, if you use multi-processing instead of multi-threading, your application will run on multiple processes on several cores in parallel which will boost the performance of your application.

Sign up to request clarification or add additional context in comments.

1 Comment

That worked like a charm, you can also write the code to be like Pool(multiprocessing.cpu_count()) so it is platform agnostic and will leverage as much computing power as possible.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.