I have a dictionary (in memory) data that has ~ 10,000 keys which each key represent a stock ticker, and the value stores the pandas dataframe representation of time series data for daily stock price. I am trying to calculate the pairwise Pearson correlation.
The code takes a long time ~3 hr to fully iterate through all the combinations O(n^2) ~ C(2, 10000). I tried to use multiprocessing dummy package but saw no performance gain AT ALL (actually slower as the number of workers increases).
from multiprocessing.dummy import Pool
def calculate_correlation((t1, t2)):
# pseudo code here
return pearsonr(data[t1]['Close'], data[t2]['Close'])
todos = []
for idx, t1 in enumerate(list(data.keys())):
for t2 in list(data.keys())[idx:]: # only the matrix top triangle
todos.append((t1, t2))
pool = Pool(4)
results = pool.map(calculate_correlation, todos)
pool.close()
pool.join()
All the data has been loaded into memory so it should not be IO intensive. Is there any reason that why there is no performance gain at all?