Pandas DataFrame Multithreading No Performance Gain

Question

I have a dictionary (in memory) data that has ~ 10,000 keys which each key represent a stock ticker, and the value stores the pandas dataframe representation of time series data for daily stock price. I am trying to calculate the pairwise Pearson correlation.

The code takes a long time ~3 hr to fully iterate through all the combinations O(n^2) ~ C(2, 10000). I tried to use multiprocessing dummy package but saw no performance gain AT ALL (actually slower as the number of workers increases).

from multiprocessing.dummy import Pool

def calculate_correlation((t1, t2)):
    # pseudo code here
    return pearsonr(data[t1]['Close'], data[t2]['Close'])

todos = []

for idx, t1 in enumerate(list(data.keys())):
    for t2 in list(data.keys())[idx:]:  # only the matrix top triangle
        todos.append((t1, t2))

pool = Pool(4)
results = pool.map(calculate_correlation, todos)
pool.close()
pool.join()

All the data has been loaded into memory so it should not be IO intensive. Is there any reason that why there is no performance gain at all?

Heapify · Accepted Answer · 2018-08-13 17:00:00Z

4

When you use multiprocessing.dummy, you're using threads, not processes. For a CPU-bound application in Python, you are usually not going to get performance boost when using multi-threading. You should use multi-processing instead to parallelize your code in Python. So, if you change your code from

from multiprocessing.dummy import Pool

to

from multiprocessing import Pool

This should substantially improve your performance.

The above will fix your problem, but if you want to know why this happened. Please continue reading:

Multi-threading in Python has Global Interpreter Lock (GIL) that prevents two threads in the same process to run at the same time. If you had a a lot of disk IO happening, multi-threading would have helped because DISK IO is separate process that can handle locks. Or, if you had a separate application used by your Python code that can handle locks, multi-threading would have helped. Multi-processing, on the other hand, will use all the cores of your CPU as separate processes as opposed to multi-threading. In CPU bound Python application such as yours, if you use multi-processing instead of multi-threading, your application will run on multiple processes on several cores in parallel which will boost the performance of your application.

answered Aug 13, 2018 at 17:00

Heapify

2,97623 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

B.Mr.W. Over a year ago

That worked like a charm, you can also write the code to be like Pool(multiprocessing.cpu_count()) so it is platform agnostic and will leverage as much computing power as possible.

Collectives™ on Stack Overflow

Pandas DataFrame Multithreading No Performance Gain

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related