1

My sample data is below:

data1 = {'index':  ['001', '001', '001', '002', '002', '003', '004','004'],
        'type' : ['red', 'red', 'red', 'yellow', 'red', 'green', 'blue', 'blue'],
        'class' : ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']}
df1 = pd.DataFrame (data1, columns = ['index', 'type', 'class']) 
df1
    index   type    class
0   001     red     A
1   001     red     A
2   001     red     A
3   002     yellow  A
4   002     red     A
5   003     green   A
6   004     blue    A
7   004     blue    A

data2 = {'index':  ['001', '001', '002', '003', '004'],
        'type' : ['red', 'red', 'yellow', 'green', 'blue'],
        'class' : ['A', 'A', 'A', 'B', 'A'],
        'outcome': ['in', 'in', 'out', 'in', 'out']}
df2 = pd.DataFrame (data2, columns = ['index', 'type', 'class', 'outcome']) 
df2
    index   type    class   outcome
0   001     red     A       in
1   001     red     A       in
2   002     yellow  A       out
3   003     green   B       in
4   004     blue    A       out

In df1, the class = A, in df2 it can be A, B or C. I want to add the missing rows in df2 from df1. df1 has the counts of types for each index. For example if in df1 index 001 appears 3 times it means I should also have it 3 times in df2. For rows from df1 that are not in df2, column outcome should equal NaN. OUTPUT should be:

    index   type    class   outcome
0   001     red     A       in
1   001     red     A       in
2   001     red     A       NaN
3   002     yellow  A       out
4   002     red     A       NaN
5   003     green   A       NaN
6   003     green   B       in
7   004     blue    A       out
8   004     blue    A       NaN

I tried with pd.concat and pd.merge but I kept getting duplicates or wrong rows added. Does someone have an idea of how to do this?

2 Answers 2

1

Use GroupBy.cumcount for counter values for uniqueness, so possible use outer join by DataFrame.merge in next step:

df1['group'] = df1.groupby(['index','type','class']).cumcount()
df2['group'] = df2.groupby(['index','type','class']).cumcount()

df = (df1.merge(df2, on=['index','type','class','group'], how='outer')
         .sort_values(by=['index', 'class'])
         .drop(columns='group'))
print (df)
  index    type class outcome
0   001     red     A      in
1   001     red     A      in
2   001     red     A     NaN
3   002  yellow     A     out
4   002     red     A     NaN
5   003   green     A     NaN
8   003   green     B      in
6   004    blue     A     out
7   004    blue     A     NaN
Sign up to request clarification or add additional context in comments.

Comments

1
df1['index_id'] = df1.groupby('index').cumcount()
df2['index_id'] = df2.groupby('index').cumcount()

merged = (
    df2
    .merge(df1, how='outer', on=['index', 'type', 'class', 'index_id'])
    .sort_values(by=['index', 'class'])
    .reset_index(drop=True)
    .drop(columns='index_id')
)

print(merged)
    index   type  class outcome
0   001     red    A    in
1   001     red    A    in
2   001     red    A    NaN
3   002     yellow A    out
4   002     red    A    NaN
5   003     green  A    NaN
6   003     green  B    in
7   004     blue   A    out
8   004     blue   A    NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.