0

I am having problems running OLS in Python after reading in Stata data. Below are my codes and error message

import pandas as pd  # To read data
import numpy as np
import statsmodels.api as sm

gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())

The error message says:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

So any thoughts how to run this simple OLS?

5
  • 1
    Can you share a small example of the data? Commented Aug 31, 2020 at 3:36
  • Certainly. You can download the data file link and reproduce the results using my codes above. Commented Aug 31, 2020 at 14:24
  • 1
    Please post data in body of question to avoid dead or forbidden links for current and future readers. Commented Aug 31, 2020 at 20:31
  • So no link in the comment, but link in the body of my question? Commented Aug 31, 2020 at 23:01
  • Yes, this is what @Parfait is asking. It is important to pick a set of observations that reproduces your problem. The community has produced some guidance for this process here. Commented Sep 1, 2020 at 5:51

2 Answers 2

4

Your age variable contains a value "89 or older" which is causing it to be read as a string, which is not a valid input for statsmodels. You have to deal with this so it can be read as integer or float, for example like this:

gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
gss = gss[gss.age != '89 or older']
gss['age'] = gss.age.astype(float)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())

P.S. I'm not saying that dropping observations where age == "89 or older" is the best way. You'll have to decide how best to deal with this. If you want to have a categorical variable in your model you'll have to create dummies first.

EDIT: If your .dta file contains a numeric value with value labels, the value labels will be used as values by default causing it to be read as string. You can use convert_categoricals=False with pd.read_stata to read in the numeric values.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot for your help here. I am kind of confused. The issue is '89 or older is coded as 89' in Stata. 89 or older is the value lable. So after I use read_stata to read the Stata data into python, 89 with that label would become a string? Is there anyway to read in the age variable as numeric (e.g., any option in the read_stata)? Thanks a lot!
If a .dta file contains a variable with value labels, pandas.read_stata takes the value labels as the values for the DataFrame by default, see link. You can add the convert_categoricals=False option to read in the numeric values, which in this case actually appears to be a better solution. I'll add this to my answer.
Adding convert_categoricals=False made it work! Fantastic! Thanks a lot!
0

An alternative second line of @Wouter's solution could be:

gss.loc[gss.age=='89 or older','age']='89'

See this discussion of replacing based on a condition for more details.

Of course, whether this replacement is appropriate depends on your use case.

1 Comment

Note that Wouter's question completely addresses the question you asked in the original post. I'd encourage you to mark that as the accepted answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.