Polynomial Regression Using statsmodels.formula.api

Question

Please forgive my ignorance. All I'm trying to do is add a squared term to my regression without going through the trouble of defining a new column in my dataframe. I'm using statsmodels.formula.api (as stats) because the format is similar to R, which I am more familiar with.

hours_model = stats.ols(formula='act_hours ~ h_hours + C(month) + trend', data = df).fit()

The above works as expected.

hours_model = stats.ols(formula='act_hours ~ h_hours + h_hours**2 + C(month) + trend', data = df).fit()

This omits h_hours**2 and returns the same output as the line above.

I've also tried: h_hours^2, math.pow(h_hours,2), and poly(h_hours,2) All throw errors.

Any help would be appreciated.

What dtype are the h_hours values? If it's treated as categorical, then you need to convert to float or some other type that is treated by patsy as numeric. — Josef
– Josef, Commented May 19, 2020 at 19:55
please take a look at sklearn.preprocessing.PolynomialFeatures it will help. — GIRISH kuniyal
– GIRISH kuniyal, Commented May 19, 2020 at 20:08
@Josef, thank you for your response. The dtype for df['h_hours'] is float64. — Matthew Withrow
– Matthew Withrow, Commented May 20, 2020 at 13:15
@GIRISHkuniyal, thanks. I looked into it, but I don't think it fits for what I'm trying to do. I'm just looking for a squared term without any interaction. — Matthew Withrow
– Matthew Withrow, Commented May 20, 2020 at 13:18

StupidWolf · Accepted Answer · 2020-05-20 18:15:05Z

27

You can try using I() like in R:

import statsmodels.formula.api as smf

np.random.seed(0)

df = pd.DataFrame({'act_hours':np.random.uniform(1,4,100),'h_hours':np.random.uniform(1,4,100),
                  'month':np.random.randint(0,3,100),'trend':np.random.uniform(0,2,100)})

model = 'act_hours ~ h_hours + I(h_hours**2)'
hours_model = smf.ols(formula = model, data = df)

hours_model.exog[:5,]

array([[ 1.        ,  3.03344961,  9.20181654],
       [ 1.        ,  1.81002392,  3.27618659],
       [ 1.        ,  3.20558207, 10.27575638],
       [ 1.        ,  3.88656564, 15.10539244],
       [ 1.        ,  1.74625943,  3.049422  ]])

answered May 20, 2020 at 18:15

StupidWolf

47.1k17 gold badges50 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mohammad-Reza Malekpour · Accepted Answer · 2022-03-25 16:16:49Z

4

Currently, although the statsmodels formula API (in fact Patsy library) doesn't support poly(variable, degree) function as in R, NumPy's vander(variable, degree+1) can do the job. However, pay attention that np.vander() produces the Vandermonde matrix which means you get intercept column too! Let's see this function in an example:

>> x = np.array([1, 2, 3, 5])
>> np.vander(x, 4, increasing=True)

array([[  1,   1,   1,   1],
       [  1,   2,   4,   8],
       [  1,   3,   9,  27],
       [  1,   5,  25, 125]])

So, you need to remove Patsy's internal intercept by adding -1 to your formula:

hours_model = stats.ols(formula='act_hours ~ np.vander(h_hours, 3, increasing=True) - 1', data = df).fit()

Note that you need to pass your_desired_degree + 1 because the first column is x^0=1.

edited Mar 25, 2022 at 16:16

answered Nov 3, 2021 at 17:11

Mohammad-Reza Malekpour

1,02312 silver badges18 bronze badges

Comments

tueda · Accepted Answer · 2023-11-28 21:30:15Z

This is based on the answer by Mohammad-Reza Malekpour, but removes the intercept column from the Vandermonde matrix.

import numpy as np


def poly(x, degree):
    return np.vander(x, degree + 1, increasing=True)[:, 1:]

Example:

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

n = 30
np.random.seed(42)
x = np.random.uniform(0, 1, size=n)
y = 1 + 2 * x + 3 * x**2 + 4 * x**3 + np.random.normal(0, 0.001, size=n)
df = pd.DataFrame({"x": x, "y": y})

results = smf.ols(formula="y ~ poly(x, 3)", data=df).fit()
print(results.summary().tables[1])

=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         1.0009      0.001   1599.564      0.000       1.000       1.002
poly(x, 3)[0]     1.9935      0.006    361.673      0.000       1.982       2.005
poly(x, 3)[1]     3.0124      0.013    232.711      0.000       2.986       3.039
poly(x, 3)[2]     3.9919      0.009    463.711      0.000       3.974       4.010
=================================================================================

Compared with:

results = smf.ols(formula="y ~ x + I(x**2) + I(x**3)", data=df).fit()
print(results.summary().tables[1])

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0009      0.001   1599.564      0.000       1.000       1.002
x              1.9935      0.006    361.673      0.000       1.982       2.005
I(x ** 2)      3.0124      0.013    232.711      0.000       2.986       3.039
I(x ** 3)      3.9919      0.009    463.711      0.000       3.974       4.010
==============================================================================

Your poly function is very helpful to write formula for variables with multiple terms without having to write each term one by one. Thank you :-)

Collectives™ on Stack Overflow

Polynomial Regression Using statsmodels.formula.api

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related