0

I am working with a medical database, and I have a dataframe that looks like this:

      INC_KEY   COMORBID1    COMORBID2    COMORBID3  COMORBID4
0   160389417   Hypertension None         None       None
1   160789043   COPD         Hypertension Diabetes   None
2   160039662   Hypertension ADLC         Other      None
3   160367584   Diabetes     None         None       None
4   160008818   None         None         None       None

As you can see, there are multiple columns for comorbidities that are labeled numerically, each of which can have several different values.

I need to make it such that the column names are the comorbidities, and the values are either a 0 for no and a 1 for yes.

Example:

      INC_KEY  HYPERTENSION COPD ADLC DIABETES
 0  160389417             1    0    0        0
 1  160789043             1    1    0        1
 2  160039662             1    0    1        0
 3  160367584             0    0    0        1
 4  160008818             0    0    0        0

I've given a simplified version, but there are 24 different possible comorbidities with which I need to do this.

I have tried pd.get_dummies(), however it does not work the way I need it to. The get_dummies function creates individual columns for each unique value for EACH COMORBID1-COMORBID24. So instead of 24 new columns, I end up with 24*24=576 new columns.

So with get_dummies the new column names would be:

COMORBID1_HYPERTENSION, COMORBID1_COPD, COMORBID1_ADLC, COMORBID1_DIABETES, COMORBID2_HYPERTENSION, COMOBID2_COPD, COMORBID2_ADLC, COMORBID2_DIABETES...

and so on all the way through 24.

What is the best way to do what I am trying to do?

Thank you in advance to anyone who helps :)

1 Answer 1

2

First convert INC_KEY to index and remove only NaNs columns by DataFrame.dropna, then use get_dummies with both parameters and last aggregate max for only 0,1 values:

df = df.set_index('INC_KEY').dropna(axis=1, how='all')
df = pd.get_dummies(df, prefix='', prefix_sep='').groupby(level=0, axis=1).max()
print (df)
           ADLC  COPD  Diabetes  Hypertension  Other
INC_KEY                                             
160389417     0     0         0             1      0
160789043     0     1         1             1      0
160039662     1     0         0             1      1
160367584     0     0         1             0      0
160008818     0     0         0             0      0
Sign up to request clarification or add additional context in comments.

3 Comments

do you mind explaining, why do I need to convert INC_KEY to index?
@SeanRoudnitsky- Sure, because INC_KEY cannot be processing by get_dummies, so I is 'stored' to index
ah I see. How can I restore INC_KEY as the first column once I've finished with get_dummies()?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.