0

I just converted to Python from R, and now I'm trying to read in data from a csv file. I was very annoyed with all my integer columns being treated as floats, and after some digging I see that this is the problem: NumPy or Pandas: Keeping array type as integer while having a NaN value

I see that the accepted answer gives me a hint as to where to go, but problem is that I have data with hundreds of columns, as is typical when doing data science, I suppose. So I don't want to specify for every column what type to use when reading in data with read_csv. This is fixed automatically in R.

Is it really this hard to use pandas to read in data in a proper way in Python?

Source: https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

4
  • @sammywemmy: how do you suggest I go about sharing sample data? Do you want me to share a csv file? Commented Apr 6, 2021 at 9:46
  • @Rafaelars: As I'm saying in my question, I do not want to explicitly state the types of all columns. And no, all columns are not integers. Commented Apr 6, 2021 at 9:47
  • @sammywemmy I mean, it is well known that R is able to do this, and that Python is not, so why must I show this? Read this for context: pandas.pydata.org/pandas-docs/version/0.24/development/… Commented Apr 6, 2021 at 9:50
  • and/or see: pandas.pydata.org/pandas-docs/version/0.24/whatsnew/… Commented Apr 6, 2021 at 9:54

1 Answer 1

1

You can try using:

df = pd.read_csv('./file.csv', dtype='Int64')

Edit: So that doesn't work for strings. Instead, try something like this:

for col in df.columns[df.isna().any()].tolist():
    if df[col].dtype == 'float':
        df[col] = df[col].astype('Int64')

Loop through each column that has an NA value and check it has type of float, then convert them to Int64

Sign up to request clarification or add additional context in comments.

4 Comments

that will not work when the data has string types
Thank you for the contribution! That will work, I'm just seeking something that will fix this automatically for me. I'm surprised Python can't handle this in a nice way.
Happy to help! I don't think it's a problem with Python per se but more so a quirk of the Pandas library. You could possibly create your own read_csv function incorporating the above code, possibly as a module so that you can easily import it in the future?
I tried this and get TypeError: cannot safely cast non-equivalent float64 to int64 although I used Int64 as in the code above. My Pandas version is '0.24.2' `

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.