Improving performance of Python for loop?

Question

I am trying to write a code to construct dataFrame which consists of cointegrating pairs of portfolios (stock price is cointegrating). In this case, stocks in a portfolio are selected from S&P500 and they have the equal weights.

Also, for some economical issue, the portfolios must include the same sectors.

For example: if stocks in one portfolio are from [IT] and [Financial] sectors, the second portoflio must select stocks from [IT] and [Financial] sectors.

There are no correct number of stocks in a portfolio, so I'm considering about 10 to 20 stocks for each of them. However, when it comes to think about the combination, this is (500 choose 10), so I have an issue of computation time.

The followings are my code:

def adf(x, y, xName, yName, pvalue=0.01, beta_lower=0.5, beta_upper=1):
    res=pd.DataFrame()
    regress1, regress2 = pd.ols(x=x, y=y), pd.ols(x=y, y=x)
    error1, error2 = regress1.resid, regress2.resid
    test1, test2 = ts.adfuller(error1, 1), ts.adfuller(error2, 1)
    if test1[1] < pvalue and test1[1] < test2[1] and\
    regress1.beta["x"] > beta_lower and regress1.beta["x"] < beta_upper:
        res[(tuple(xName), tuple(yName))] = pd.Series([regress1.beta["x"], test1[1]])
        res = res.T
        res.columns=["beta","pvalue"]
        return res
    elif test2[1] < pvalue and regress2.beta["x"] > beta_lower and\
    regress2.beta["x"] < beta_upper:
        res[(tuple(yName), tuple(xName))] = pd.Series([regress2.beta["x"], test2[1]])
        res = res.T
        res.columns=["beta","pvalue"]
        return res
    else:
        pass




def coint(dataFrame, nstocks = 2, pvalue=0.01, beta_lower=0.5, beta_upper=1):
    # dataFrame = pandas_dataFrame, in this case, data['Adj Close'], row=time, col = tickers
    # pvalue = level of significance of adf test
    # nstocks = number of stocks considered for adf test (equal weight)
    # if nstocks > 2, coint return cointegration between portfolios
    # beta_lower = lower bound for slope of linear regression
    # beta_upper = upper bound for slope of linear regression

    a=time.time()
    tickers = dataFrame.columns
    tcomb = itertools.combinations(dataFrame.columns, nstocks)
    res = pd.DataFrame()
    sec = pd.DataFrame()
    for pair in tcomb:
        xName, yName = list(pair[:int(nstocks/2)]), list(pair[int(nstocks/2):])
        xind, yind = tickers.searchsorted(xName), tickers.searchsorted(yName)
        xSector = list(SNP.ix[xind]["Sector"])
        ySector = list(SNP.ix[yind]["Sector"])
        if set(xSector) == set(ySector):
            sector = [[(xSector, ySector)]]
            x, y = dataFrame[list(xName)].sum(axis=1), dataFrame[list(yName)].sum(axis=1)
            res1 = adf(x,y,xName,yName)
            if res1 is None:
                continue
            elif res.size==0:
                res=res1
                sec = pd.DataFrame(sector, index = res.index, columns = ["sector"])
                print("added : ", pair)
            else:
                res=res.append(res1)
                sec = sec.append(pd.DataFrame(sector, index = [res.index[-1]], columns = ["sector"]))
                print("added : ", pair)
    res = pd.concat([res,sec],axis=1)
    res=res.sort_values(by=["pvalue"],ascending=True)
    b=time.time()
    print("time taken : ", b-a, "sec")
    return res

when nstocks=2, this takes about 263 seconds, but as nstocks increases, the loop takes alot of time (more than a day)

I collected 'Adj Close' data from yahoo finance using pandas_datareader.data and the index is time and columns are different tickers

Any suggestions or help will be appreciated

Gábor Erdős · Accepted Answer · 2016-12-07 07:26:40Z

2

I dont know what computer you have, but i would advise you to use some kind of multiprocessing for the loop. I haven't looked really hard into your code, but as far as i see res and sec can be moved into shared memory objects, and the individual loops paralleled with multiprocessing.

If you have a decent CPU it can improve the performance 4-6 times. In case you have access to some kind of HPC it can do wonders.

answered Dec 7, 2016 at 7:26

Gábor Erdős

3,6894 gold badges28 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Lloyd Kirk · Accepted Answer · 2016-12-07 07:39:07Z

0

I'd recommend using a profiler to narrow down the most time consuming calls, and the number of loops (does your loop make the expected number of passes?). Python 3 has a profiler in the standard library:

https://docs.python.org/3.6/library/profile.html

You can either invoke it in your code:

import cProfile
cProfile.run('your_function(inputs)')

Or if a script is an easier entrypoint:

python -m cProfile [-o output_file] [-s sort_order] your-script.py

answered Dec 7, 2016 at 7:39

Lloyd Kirk

1751 silver badge4 bronze badges

Collectives™ on Stack Overflow

Improving performance of Python for loop?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related