0

I have a dataset contains multiple binary values.

df = pd.DataFrame({"a": ["y", "n"], "b": ["t", "f"], 
                   "c": ["known", "unknown"], "d": ['found', 'not found']})

I want to replace all the binary columns to be 1/0, while not affect other numeric columns. Are there any simple solutions using one or two lines? The dataset contains over 500 columns, which is difficult to check and replace them one by one. Thanks.

7
  • 1
    Welcome to SO. Please review How to Ask, and create a minimal reproducible example. That means no broken sample code for others to test. You current sample code is not valid python, so it will be difficult to help. Commented Jul 29, 2019 at 17:01
  • astype('category')? Commented Jul 29, 2019 at 17:02
  • 1
    If these are just binary, and you don't particularly care which you pick one try: pd.get_dummies(df).iloc[:, ::2]. Otherwise please provide a more complete example and explanation of what you need. Commented Jul 29, 2019 at 17:09
  • 1
    OR df.assign(**df.select_dtypes(object).apply(lambda c: c.factorize()[0])) Commented Jul 29, 2019 at 17:10
  • But as for "the 500 other columns" we need a few more constraints. Is every object column guaranteed to be a binary column you need to transform? If not, I think you'll at least need some pattern or a list of the specific columns to transform. Or perhaps we can try with nunique == 2? Commented Jul 29, 2019 at 17:12

1 Answer 1

1

Can use pd.get_dummies with drop_first=True credit to @piRSquared

pd.get_dummies(df, drop_first=True)

#   a_y  b_t  c_unknown  d_not found
#0    1    1          0            0
#1    0    0          1            1

If this needs to be done for only binary object columns subset first.

df = pd.DataFrame({'a': ['y', 'n', 'c'], 
                   'b': ['t', 'f', 't'], 
                   'c': ['known', 'unknown', 'known'],
                   'd': ['found', 'not found', 'found'],
                   'e': [1, 2, 2]})

pd.get_dummies(df.loc[:, df.agg('nunique') == 2].select_dtypes(include='object'), 
               drop_first=True)

#   b_t  c_unknown  d_not found
#0    1          0            0
#1    0          1            1
#2    1          0            0

If there are a small number of binary responses across columns, consider creating a dictionary and mapping the values:

d = {'y': 1, 'n': 0,
     't': 1, 'f': 0,
     'known': 1, 'unknown': 0,
     'found': 1, 'not found': 0}

s = (df.agg('nunique') == 2) & (df.dtypes == 'object')
for col in s[s].index:
    df[col] = df[col].map(d)

#   a  b  c  d  e
#0  y  1  1  1  1
#1  n  0  0  0  2
#2  c  1  1  1  2
#   |
#  `a` not mapped because trinary
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, but how can be sure that get_dummies assign value 1 to 'T', 'known', 'y', 'found', and 0 otherwise? and what if I don't want change column names?
@SHendricks when the data are messy there's not really an easy one liner to deal with it. You're going to need to specify the mapping so that we know "known = 1" as opposed to the opposite. I think any natural language processing to determine that is probably absolute overkill for something like this, which you can hard-code with much less time investment. If all 500 columns have 500 different binary responses you're just going to have to bite the bullet and code it how you want.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.