Linear Regression over two variables in a pandas dataframe

Question

So, I am not really a programmer, but I need to do figure out a relationship on an equation of two variables, I have been googling extensively, but I can't figure out how to input my data into sklearn linear_model.

I have a dataframe defined thus

I = [-2, 0, 5, 10, 15, 20, 25, 30]
d = {27.11 : [9.01,8.555,7.56,6.77,6.14,5.63,5.17,4.74],
     28.91 : [8.89,8.43,7.46,6.69,6.07,5.56,5.12,4.68],
     30.72 : [8.76,8.32,7.36,6.60,6.00,5.50,5.06,4.69],
     32.52 : [8.64,8.20,7.26,6.52,5.93,5.44,5.00,4.58],
     34.33 : [8.52,8.08,7.16,6.44,5.86,5.38,4.95,4.52],
     36.11 : [8.39,7.97,7.07,6.35,5.79,5.31,4.86,4.46]}
oxy = pd.DataFrame(index = I, data = d) # temp, salinity to oxygenation ml/L

With the indices representing temperature, and the column names representing salinity, and I need to come up with a way to predict an oxygenation (the values in the columns) from temperature and salinity.

I think my issue is mostly syntax related,

I have tried fitting my data by

X = [list(oxy.columns.values),list(oxy.index.values)]
regr = linear_model.LinearRegression()
regr.fit(X,oxy)

along with lots variants trying to get the values at index,column in the datatable to be associated with each X. I am really just not figuring out how to do this.

I found lots of guides on two variables, but they all had flat datasets, and I don't know how to flatten this without lots and lots of typing.

So my question is, either, is there a way to do a regression on two varibles with my independent varibles being my index and column values on a pandas datatable, and or, is there a quick and efficient way to flatten this datatable into a 48 by 3 datatable, so that one of the many guides I've found will actually help me?

Thank you in advanced.

Joe Patten · Accepted Answer · 2019-02-18 20:42:00Z

3

You can use stack to reshape the data, and then rename the columns:

oxy2 = oxy.stack().reset_index()
oxy2.columns = ['salinity','temperature','oxygenation']

Output is a 48 by 3 dataframe. Showing only first 5 rows:

#print(oxy2.head())
    salinity  temperature  oxygenation
0         -2        27.11        9.010
1         -2        28.91        8.890
2         -2        30.72        8.760
3         -2        32.52        8.640
4         -2        34.33        8.520

Then you can run the regression using the following code:

regr = linear_model.LinearRegression()
regr.fit(oxy2[['salinity','temperature']], oxy2['oxygenation'])

edited Feb 18, 2019 at 20:42

answered Feb 18, 2019 at 20:36

Joe Patten

1,7041 gold badge11 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Enquandriant Over a year ago

Thank you so much. I've been messing around with pd.melt thinking it would help but this is much better!

James Phillips Over a year ago

I suggest visually inspecting scatterplots of temperature and salinity versus oxygenation to determine if there is any obvious data transform such as log or exp that might help with the linear regression - this is fast and easy to do.

James Phillips Over a year ago

I see from a 3D scatterplot that the data does not lie on a flat plane. When I add an interaction effect of "salinity * temperature" to the regression, the fit is improved.

Collectives™ on Stack Overflow

Linear Regression over two variables in a pandas dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related