Problem Running OLS with Stata Data in Python

Question

I am having problems running OLS in Python after reading in Stata data. Below are my codes and error message

import pandas as pd  # To read data
import numpy as np
import statsmodels.api as sm

gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())

The error message says:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

So any thoughts how to run this simple OLS?

Certainly. You can download the data file link and reproduce the results using my codes above. — WaterWood
– WaterWood, Commented Aug 31, 2020 at 14:24
Please post data in body of question to avoid dead or forbidden links for current and future readers. — Parfait
– Parfait, Commented Aug 31, 2020 at 20:31
So no link in the comment, but link in the body of my question? — WaterWood
– WaterWood, Commented Aug 31, 2020 at 23:01
Yes, this is what @Parfait is asking. It is important to pick a set of observations that reproduces your problem. The community has produced some guidance for this process here. — Arthur Morris
– Arthur Morris, Commented Sep 1, 2020 at 5:51

Wouter · Accepted Answer · 2020-09-01 07:08:36Z

4

Your age variable contains a value "89 or older" which is causing it to be read as a string, which is not a valid input for statsmodels. You have to deal with this so it can be read as integer or float, for example like this:

gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
gss = gss[gss.age != '89 or older']
gss['age'] = gss.age.astype(float)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())

P.S. I'm not saying that dropping observations where age == "89 or older" is the best way. You'll have to decide how best to deal with this. If you want to have a categorical variable in your model you'll have to create dummies first.

EDIT: If your .dta file contains a numeric value with value labels, the value labels will be used as values by default causing it to be read as string. You can use convert_categoricals=False with pd.read_stata to read in the numeric values.

edited Sep 1, 2020 at 7:08

answered Aug 31, 2020 at 19:50

Wouter

3,27110 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

WaterWood Over a year ago

Thanks a lot for your help here. I am kind of confused. The issue is '89 or older is coded as 89' in Stata. 89 or older is the value lable. So after I use read_stata to read the Stata data into python, 89 with that label would become a string? Is there anyway to read in the age variable as numeric (e.g., any option in the read_stata)? Thanks a lot!

Wouter Over a year ago

If a .dta file contains a variable with value labels, pandas.read_stata takes the value labels as the values for the DataFrame by default, see link. You can add the convert_categoricals=False option to read in the numeric values, which in this case actually appears to be a better solution. I'll add this to my answer.

WaterWood Over a year ago

Adding convert_categoricals=False made it work! Fantastic! Thanks a lot!

Arthur Morris · Accepted Answer · 2020-09-01 01:52:35Z

0

An alternative second line of @Wouter's solution could be:

gss.loc[gss.age=='89 or older','age']='89'

See this discussion of replacing based on a condition for more details.

Of course, whether this replacement is appropriate depends on your use case.

answered Sep 1, 2020 at 1:52

Arthur Morris

1,3381 gold badge15 silver badges23 bronze badges

1 Comment

Arthur Morris Over a year ago

Note that Wouter's question completely addresses the question you asked in the original post. I'd encourage you to mark that as the accepted answer.

Collectives™ on Stack Overflow

Problem Running OLS with Stata Data in Python

2 Answers 2

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related