Ordinary Least Squares Regression for multiple columns in Pandas Dataframe

Question

I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1

    Time    A1      A2      A3      B1      B2      B3
1   1.00    6.64    6.82    6.79    6.70    6.95    7.02
2   2.00    6.70    6.86    6.92    NaN     NaN     NaN
3   3.00    NaN     NaN     NaN     7.07    7.27    7.40
4   4.00    7.15    7.26    7.26    7.19    NaN     NaN
5   5.00    NaN     NaN     NaN     NaN     7.40    7.51
6   5.50    7.44    7.63    7.58    7.54    NaN     NaN 
7   6.00    7.62    7.86    7.71    NaN     NaN     NaN

This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:

from sklearn.linear_model import LinearRegression

series = np.array([]) #blank list to append result

df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]

series= np.concatenate((SGR_trips, m), axis = 0)

As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.

I tried using a for loop such as:

for col in df1.columns:

and replacing 'A1', for example with col in the code, but this does not seem to be working.

Is there any way I can do this more efficiently?

Thank you!

piRSquared · Accepted Answer · 2016-07-16 09:59:16Z

9

One liner (or three)

time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
             ['Slope'], df.columns)

Broken down with a bit of explanation

Using the closed form of OLS

In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.

is np.linalg.pinv(time.T.dot(time)).dot(time.T)

Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.

Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.

Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!

answered Jul 16, 2016 at 9:59

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

RageQuilt Over a year ago

Works great. Thanks a million!

Nat Wilson · Accepted Answer · 2016-07-16 05:03:28Z

1

Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:

slopes = []

for c in cols:
    if c=="Time": break
    mask = ~np.isnan(df1[c])
    x = np.atleast_2d(df1.Time[mask].values).T
    y = np.atleast_2d(df1[c][mask].values).T
    reg = LinearRegression().fit(x, y)
    slopes.append(reg.coef_[0])

I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.

answered Jul 16, 2016 at 5:03

Nat Wilson

4413 silver badges4 bronze badges

1 Comment

RageQuilt Over a year ago

Ahh this was driving me crazy, thanks for the quick fix!

Collectives™ on Stack Overflow

Ordinary Least Squares Regression for multiple columns in Pandas Dataframe

2 Answers 2

One liner (or three)

Broken down with a bit of explanation

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

One liner (or three)

Broken down with a bit of explanation

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related