1

I'm trying to replace strings with integers in a pandas dataframe. I've already visited here but the solution doesn't work.

Reprex:

import pandas as pd 
pd.__version__

> '1.4.1'


test = pd.DataFrame(data = {'a': [None, 'Y', 'N', '']}, dtype = 'string')
test.replace(to_replace = 'Y', value = 1)

> ValueError: Cannot set non-string value '1' into a StringArray.

I know that I could do this individually for each column, either explicitly or using apply, but I am trying to avoid that. I'd ideally replace all 'Y' in the dataframe with int(1), all 'N' with int(0) and all '' with None or pd.NA, so the replace function appears to be the fastest/clearest way to do this.

5
  • 1
    You can change the column from string type to object type which will allow you to set mixed datatypes in that column Commented Jun 28, 2022 at 22:01
  • The issue is that not all columns in my actual data are needing conversion. I have over 300 columns, but only some subset has Y/N/'' values. Would your approach require converting all columns to object? Or, would I have to explicitly hardcode which columns to convert to `object'? Ideally I'd convert only columns that need converting, without hardcoding. Commented Jun 28, 2022 at 22:07
  • 1
    It's a highly pragmatic over performant solution, but is there a reason you need to have your other columns be single-typed? Could you just convert the entire DF to object? Otherwise, if you're staying single-typed, would '1' work as well as 1 for whatever operations you need to do next? Commented Jun 28, 2022 at 22:29
  • Or just the string columns: for i in test.select_dtypes('string').columns: test[i] = test[i].astype(object) Commented Jun 28, 2022 at 22:46
  • 1
    Yeah you're right, I ended up just converting the whole thing to object, then using the pandas.convert_dtypes() function to back-convert, and it pretty much takes care of everything. Thanks! Commented Jun 28, 2022 at 22:49

1 Answer 1

1

Use Int8Dtype. IntXXDtype allow integer values and <NA>:

test['b'] = test['a'].replace({'Y': '1', 'N': '0', '': pd.NA}).astype(pd.Int8Dtype())
print(test)

# Output
      a     b
0  <NA>  <NA>
1     Y     1
2     N     0
3        <NA>
>>> [type(x) for x in test['b']]
[pandas._libs.missing.NAType,
 numpy.int8,
 numpy.int8,
 pandas._libs.missing.NAType]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.