1

I have two large pandas dataframes (with millions of rows) that contain two columns, a group identifier and an id. I'm trying to create a combined dataframe that contains the group, id, plus a column with a 1 if the id was in the first dataframe, else a 0 and a column with a 1 if the id was in the second dataframe, else a 0.

In other words, I'm trying to merge the two dataframes and create conditional columns based on if the id was present for each original dataframe. Any suggestions on how to approach this problem?

Here is a small example:

import pandas as pd

>>> df_a = pd.DataFrame({'group': list('AAABBB'), 'id': [11,12,13,21,22,23]})
>>> df_b = pd.DataFrame({'group': list('AAABB'), 'id': [11,13,14,22,24]})
>>> df_a

 group  id
 A     11
 A     12
 A     13
 B     21
 B     22
 B     23

>>> df_b

group  id
A     11
A     13
A     14
B     22
B     24

The output should look like this:

>>> df_full      
 group  id  a  b
 A      11  1  1
 A      12  1  0
 A      13  1  1
 A      14  0  1
 B      21  1  0
 B      22  1  1
 B      23  1  0
 B      24  0  1

1 Answer 1

2

You can create two columns for each of the data frame with one before merging and fill na with zero after merging:

df_a['a'] = 1
df_b['b'] = 1

pd.merge(df_a, df_b, how = 'outer', on = ['group', 'id']).fillna(0)

# group   id      a   b
# 0   A 11.0    1.0 1.0
# 1   A 12.0    1.0 0.0
# 2   A 13.0    1.0 1.0
# 3   B 21.0    1.0 0.0
# 4   B 22.0    1.0 1.0
# 5   B 23.0    1.0 0.0
# 6   A 14.0    0.0 1.0
# 7   B 24.0    0.0 1.0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.