Correlation on Python

Question

I have the following dataframe:

    StockId Date    Value
    1       2015-01-02  -0.070012
    2       2015-01-02  -0.022447
    4       2015-01-02  -0.011474
    6       2015-01-02  0.003796
    13      2015-01-02  -0.032061
    ...
    355     2018-09-14  -0.035717
    356     2018-09-14  -0.007899
    357     2018-09-14  0.065217
    358     2018-09-14  0.063536
    359     2018-09-14  -0.023433

I'm looking to find the correlation between stocks over time in order to find the five stocks that are most correlated with stock 1. Is there a quick way to do this using pandas? Or does this require creating arrays and then calculating the correlations one by one? There are 359 stocks in the data frame.

Are your first few rows missing a column? The dataframe seems to jump from 3 to 4 columns. If so, can you update the column headers. Also, what is shape of your dataframe? — Cohan
– Cohan, Commented Jan 10, 2020 at 19:12
Apologies - no missing column, I am just omitting the indices. The shape is around 2555 x 3. — user12685722
– user12685722, Commented Jan 10, 2020 at 19:24

Cohan · Accepted Answer · 2020-01-10 19:56:21Z

1

Assuming your dataframe is in a long format where each stock is valued once per day, you can use the pivot function to reshape into a wide format. Specify Date to be the index of the new dataframe and StockID to be the columns. If you have data that is sampled more than daily, you can specify the aggfunc argument to be min/max/avg or whatever else you deem appropriate for your application. If you have data that is sampled less than daily, you can still run the code, but be aware that the correlation will be based on some null values.

Note: I'm only saying daily because that's what your table seems to imply.

From there you can use df.corr() to view the correlation matrix.

df = df.pivot(index='Date', columns='StockID')
df.columns = df.columns.droplevel()  # Convert multi-index to single index
print(df)
# StockID           a         b         c
# Date
# 1/10/2020  0.956625  0.175345  0.999375
# 1/11/2020  0.458859  0.714604  0.995440
# 1/12/2020  0.603331  0.881022  0.215262
# 1/13/2020  0.584198  0.303796  0.332117

matrix = df.corr()
print(matrix)
# StockID         a         b         c
# StockID                              
# a        1.000000 -0.680290  0.305365
# b       -0.680290  1.000000 -0.336229
# c        0.305365 -0.336229  1.000000

From there, you could iterate through each row, sort the row by values, and then you'll have a dict sorted by the strongest correlation.

for stock, corr in matrix.to_dict().items():
    corr = {
        k: v for k, v
        in sorted(corr.items(), key=lambda item: -item[1])
        if k != stock
    }
    print(stock, corr)
# a {'c': 0.30536503121224934, 'b': -0.6802897760166712}
# b {'c': -0.3362290204607999, 'a': -0.6802897760166712}
# c {'a': 0.30536503121224934, 'b': -0.3362290204607999}

Or, if you want a more visual comparison,

plt.matshow(matrix)
plt.colorbar()
plt.show()

edited Jan 10, 2020 at 19:56

answered Jan 10, 2020 at 19:23

Cohan

4,5942 gold badges25 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user12685722 Over a year ago

Sadly, I don't have each stock valued once per day. When I run the command I get: "None of ['StockID'] are in the columns". Some of them are valued each day, others are not.

Cohan Over a year ago

I noticed StockID in my code is StockId in your code. Are the stocks valued more frequently or less frequently than one day? The pivot funciton has a aggfunc argument that could be used.

user12685722 Over a year ago

If I want to visualize only one stock from the correlation matrix, how do I substitute the print(stock, corr) command?

Cohan Over a year ago

That depends on how you want to visualize it. If you've already pivoted the data, you could do df[[stock_id]].plot() assuming you've imported matplotlib, you could then use plt.show()

user12685722 Over a year ago

Thanks! If I want to see only one stock (stock 1) from the correlation matrix (to see how it is correlated with the other stocks), how do I do? matrix[[1]] is not working.

|

Collectives™ on Stack Overflow

Correlation on Python

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related