0

I am trying to write a code to construct dataFrame which consists of cointegrating pairs of portfolios (stock price is cointegrating). In this case, stocks in a portfolio are selected from S&P500 and they have the equal weights.

Also, for some economical issue, the portfolios must include the same sectors.

For example: if stocks in one portfolio are from [IT] and [Financial] sectors, the second portoflio must select stocks from [IT] and [Financial] sectors.

There are no correct number of stocks in a portfolio, so I'm considering about 10 to 20 stocks for each of them. However, when it comes to think about the combination, this is (500 choose 10), so I have an issue of computation time.

The followings are my code:

def adf(x, y, xName, yName, pvalue=0.01, beta_lower=0.5, beta_upper=1):
    res=pd.DataFrame()
    regress1, regress2 = pd.ols(x=x, y=y), pd.ols(x=y, y=x)
    error1, error2 = regress1.resid, regress2.resid
    test1, test2 = ts.adfuller(error1, 1), ts.adfuller(error2, 1)
    if test1[1] < pvalue and test1[1] < test2[1] and\
    regress1.beta["x"] > beta_lower and regress1.beta["x"] < beta_upper:
        res[(tuple(xName), tuple(yName))] = pd.Series([regress1.beta["x"], test1[1]])
        res = res.T
        res.columns=["beta","pvalue"]
        return res
    elif test2[1] < pvalue and regress2.beta["x"] > beta_lower and\
    regress2.beta["x"] < beta_upper:
        res[(tuple(yName), tuple(xName))] = pd.Series([regress2.beta["x"], test2[1]])
        res = res.T
        res.columns=["beta","pvalue"]
        return res
    else:
        pass




def coint(dataFrame, nstocks = 2, pvalue=0.01, beta_lower=0.5, beta_upper=1):
    # dataFrame = pandas_dataFrame, in this case, data['Adj Close'], row=time, col = tickers
    # pvalue = level of significance of adf test
    # nstocks = number of stocks considered for adf test (equal weight)
    # if nstocks > 2, coint return cointegration between portfolios
    # beta_lower = lower bound for slope of linear regression
    # beta_upper = upper bound for slope of linear regression

    a=time.time()
    tickers = dataFrame.columns
    tcomb = itertools.combinations(dataFrame.columns, nstocks)
    res = pd.DataFrame()
    sec = pd.DataFrame()
    for pair in tcomb:
        xName, yName = list(pair[:int(nstocks/2)]), list(pair[int(nstocks/2):])
        xind, yind = tickers.searchsorted(xName), tickers.searchsorted(yName)
        xSector = list(SNP.ix[xind]["Sector"])
        ySector = list(SNP.ix[yind]["Sector"])
        if set(xSector) == set(ySector):
            sector = [[(xSector, ySector)]]
            x, y = dataFrame[list(xName)].sum(axis=1), dataFrame[list(yName)].sum(axis=1)
            res1 = adf(x,y,xName,yName)
            if res1 is None:
                continue
            elif res.size==0:
                res=res1
                sec = pd.DataFrame(sector, index = res.index, columns = ["sector"])
                print("added : ", pair)
            else:
                res=res.append(res1)
                sec = sec.append(pd.DataFrame(sector, index = [res.index[-1]], columns = ["sector"]))
                print("added : ", pair)
    res = pd.concat([res,sec],axis=1)
    res=res.sort_values(by=["pvalue"],ascending=True)
    b=time.time()
    print("time taken : ", b-a, "sec")
    return res

when nstocks=2, this takes about 263 seconds, but as nstocks increases, the loop takes alot of time (more than a day)

I collected 'Adj Close' data from yahoo finance using pandas_datareader.data and the index is time and columns are different tickers

Any suggestions or help will be appreciated

2 Answers 2

2

I dont know what computer you have, but i would advise you to use some kind of multiprocessing for the loop. I haven't looked really hard into your code, but as far as i see res and sec can be moved into shared memory objects, and the individual loops paralleled with multiprocessing.

If you have a decent CPU it can improve the performance 4-6 times. In case you have access to some kind of HPC it can do wonders.

Sign up to request clarification or add additional context in comments.

Comments

0

I'd recommend using a profiler to narrow down the most time consuming calls, and the number of loops (does your loop make the expected number of passes?). Python 3 has a profiler in the standard library:

https://docs.python.org/3.6/library/profile.html

You can either invoke it in your code:

import cProfile
cProfile.run('your_function(inputs)')

Or if a script is an easier entrypoint:

python -m cProfile [-o output_file] [-s sort_order] your-script.py

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.