8

Please forgive my ignorance. All I'm trying to do is add a squared term to my regression without going through the trouble of defining a new column in my dataframe. I'm using statsmodels.formula.api (as stats) because the format is similar to R, which I am more familiar with.

hours_model = stats.ols(formula='act_hours ~ h_hours + C(month) + trend', data = df).fit()

The above works as expected.

hours_model = stats.ols(formula='act_hours ~ h_hours + h_hours**2 + C(month) + trend', data = df).fit()

This omits h_hours**2 and returns the same output as the line above.

I've also tried: h_hours^2, math.pow(h_hours,2), and poly(h_hours,2) All throw errors.

Any help would be appreciated.

4
  • What dtype are the h_hours values? If it's treated as categorical, then you need to convert to float or some other type that is treated by patsy as numeric. Commented May 19, 2020 at 19:55
  • please take a look at sklearn.preprocessing.PolynomialFeatures it will help. Commented May 19, 2020 at 20:08
  • @Josef, thank you for your response. The dtype for df['h_hours'] is float64. Commented May 20, 2020 at 13:15
  • @GIRISHkuniyal, thanks. I looked into it, but I don't think it fits for what I'm trying to do. I'm just looking for a squared term without any interaction. Commented May 20, 2020 at 13:18

3 Answers 3

27

You can try using I() like in R:

import statsmodels.formula.api as smf

np.random.seed(0)

df = pd.DataFrame({'act_hours':np.random.uniform(1,4,100),'h_hours':np.random.uniform(1,4,100),
                  'month':np.random.randint(0,3,100),'trend':np.random.uniform(0,2,100)})

model = 'act_hours ~ h_hours + I(h_hours**2)'
hours_model = smf.ols(formula = model, data = df)

hours_model.exog[:5,]

array([[ 1.        ,  3.03344961,  9.20181654],
       [ 1.        ,  1.81002392,  3.27618659],
       [ 1.        ,  3.20558207, 10.27575638],
       [ 1.        ,  3.88656564, 15.10539244],
       [ 1.        ,  1.74625943,  3.049422  ]])
Sign up to request clarification or add additional context in comments.

Comments

4

Currently, although the statsmodels formula API (in fact Patsy library) doesn't support poly(variable, degree) function as in R, NumPy's vander(variable, degree+1) can do the job. However, pay attention that np.vander() produces the Vandermonde matrix which means you get intercept column too! Let's see this function in an example:

>> x = np.array([1, 2, 3, 5])
>> np.vander(x, 4, increasing=True)

array([[  1,   1,   1,   1],
       [  1,   2,   4,   8],
       [  1,   3,   9,  27],
       [  1,   5,  25, 125]])

So, you need to remove Patsy's internal intercept by adding -1 to your formula:

hours_model = stats.ols(formula='act_hours ~ np.vander(h_hours, 3, increasing=True) - 1', data = df).fit()

Note that you need to pass your_desired_degree + 1 because the first column is x^0=1.

Comments

1

This is based on the answer by Mohammad-Reza Malekpour, but removes the intercept column from the Vandermonde matrix.

import numpy as np


def poly(x, degree):
    return np.vander(x, degree + 1, increasing=True)[:, 1:]

Example:

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

n = 30
np.random.seed(42)
x = np.random.uniform(0, 1, size=n)
y = 1 + 2 * x + 3 * x**2 + 4 * x**3 + np.random.normal(0, 0.001, size=n)
df = pd.DataFrame({"x": x, "y": y})

results = smf.ols(formula="y ~ poly(x, 3)", data=df).fit()
print(results.summary().tables[1])
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         1.0009      0.001   1599.564      0.000       1.000       1.002
poly(x, 3)[0]     1.9935      0.006    361.673      0.000       1.982       2.005
poly(x, 3)[1]     3.0124      0.013    232.711      0.000       2.986       3.039
poly(x, 3)[2]     3.9919      0.009    463.711      0.000       3.974       4.010
=================================================================================

Compared with:

results = smf.ols(formula="y ~ x + I(x**2) + I(x**3)", data=df).fit()
print(results.summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0009      0.001   1599.564      0.000       1.000       1.002
x              1.9935      0.006    361.673      0.000       1.982       2.005
I(x ** 2)      3.0124      0.013    232.711      0.000       2.986       3.039
I(x ** 3)      3.9919      0.009    463.711      0.000       3.974       4.010
==============================================================================

1 Comment

Your poly function is very helpful to write formula for variables with multiple terms without having to write each term one by one. Thank you :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.