1

Lets say I have two pandas DataFrames, df1 and df2, one containing the names and age of people, and the other detailing what they are studying. What is an efficient way to join the two, so that I have boolean fields of what each person is studying?

e.g. given the following

# df1
name  | age
------|----
John  | 24
Kelly | 49
Gemma | 18
Bob   | 29

# df2
name  | studies
------|----------
John  | education
John  | science
Kelly | science
Bob   | law
Bob   | commerce

How could I create the following dataframe with boolean values for each field of study?

name  | age | education | science | law   | commerce |
------|-----|-----------|---------|-------|----------|
John  | 24  | True      | True    | False | False    |
Kelly | 49  | False     | True    | False | False    |
Gemma | 18  | False     | False   | False | False    |
Bob   | 29  | False     | False   | True  | True     |

1 Answer 1

3

Use get_dummies with max, then join and replace missing values only for columns from df22:

s = df2.set_index('name')['studies']
df22 = pd.get_dummies(s, prefix_sep='', prefix='', dtype=bool).max(level=0)
df = df1.join(df22, on='name').fillna(dict.fromkeys(df22.columns, False))
print (df)
    name  age  commerce  education    law  science
0   John   24     False       True  False     True
1  Kelly   49     False      False  False     True
2  Gemma   18     False      False  False    False
3    Bob   29      True      False   True    False
Sign up to request clarification or add additional context in comments.

7 Comments

Amazing, thanks so much. I encountered the error TypeError: get_dummies() got an unexpected keyword argument 'dtype' but managed to get around it by using df22 = pd.get_dummies(s, prefix_sep='', prefix='').max(level=0).astype('bool') instead. Might just be an issue with my data because its not as clean as the example I gave.
@oskaboska - No, problem is your pandas version, dtype is implemented in pandas 0.23.0
Ah ok, that makes more sense. Thanks again
@jezrael - Tried df['column_name'] = df['column_name'].astype('bool') to get a boolean field in dataframe. With this, the boolean value is defaulted to True. How to default the boolean as False?
@user12379095 - Do you think df['column_name'] = ~(df['column_name'].astype('bool')) ?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.