How to add column value based on condition in another dataframe?

Question

I have a dataframe with the main fixed location data:

id    name 
1      BEL
2      BEL
3      BEL
4      NYC
5      NYC
6      NYC
7      BER
8      BER

I also have second dataframe where I get values for each id and city like this (notice, this dataframe is longer than the main dataframe):

id    name  value
1      BEL   9
2      BEL   7
3      BEL   3
4      NYC   76
5      NYC   76
6      NYC   23
7      BER   76
8      BER   2 
3      BEL   7
4      NYC   5
5      NYC   4
6      NYC   2

My goal is, I want to check the second dataframe if the values are greater than 10 or not. If greater than 10 I want to add to the first dataframe a column ['not_ok'] like 1 for not ok. How can I do this?

I can filter the second dataframe with dff['not_ok'] = np.where(dff['value'] > 10, '1', '0') but since the dff is much longer I don't know how to get that information in the first dataframe.

My goal looks something like this:

id    name  is_ok
1      BEL   1
2      BEL   1
3      BEL   1
4      NYC   0
5      NYC   0
6      NYC   0
7      BER   0
8      BER   1

Kindly add your expected output dataframe

sammywemmy
– sammywemmy

2022-09-05 10:25:34 +00:00
Commented Sep 5, 2022 at 10:25 — sammywemmy
– sammywemmy, Commented Sep 5, 2022 at 10:25

ouroboros1 · Accepted Answer · 2022-09-05 12:41:22Z

To reach the desired output you could try as follows:

import pandas as pd

data = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8}, 
        'name': {0: 'BEL', 1: 'BEL', 2: 'BEL', 3: 'NYC', 4: 'NYC', 
                 5: 'NYC', 6: 'BER', 7: 'BER'}
        }
df = pd.DataFrame(data)

data2 = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 
                7: 8, 8: 3, 9: 4, 10: 5, 11: 6}, 
         'name': {0: 'BEL', 1: 'BEL', 2: 'BEL', 3: 'NYC', 4: 'NYC', 
                  5: 'NYC', 6: 'BER', 7: 'BER', 8: 'BEL', 9: 'NYC', 
                  10: 'NYC', 11: 'NYC'}, 
         'value': {0: 9, 1: 7, 2: 3, 3: 76, 4: 76, 5: 23, 6: 76, 
                   7: 2, 8: 7, 9: 5, 10: 4, 11: 2}
         }
df2 = pd.DataFrame(data2)

df = df.merge(df2[df2['value'].gt(10)], on=['id', 'name'], how='left')\
    .rename(columns={'value':'is_ok'})
df['is_ok'] = df['is_ok'].isna().astype(int)

print(df)

   id name  is_ok
0   1  BEL      1
1   2  BEL      1
2   3  BEL      1
3   4  NYC      0
4   5  NYC      0
5   6  NYC      0
6   7  BER      0
7   8  BER      1

Explanation:

Use Series.gt to get a boolean pd.Series, which we use to select from d2 only the rows that meet the condition value > 10.
Use df.merge to merge this slice from df2 with df and rename column value to is_ok (df.rename).
We now have a column with NaN values where there is no match on id, name, and values > 10 where there is. Use Series.isna to turn this column into booleans.
Finally, we can chain .astype(int) to change True | False into 1 | 0.

solid · Accepted Answer · 2022-09-05 10:42:17Z

1

Suppose you first (shorter) daraframe is called 'df_v1' and the second (longer) is called 'df_v2'.

On 'df_v2' prepare the column like this:

df_v2["not_ok"] = df_v2["value"].apply(lambda x: x > 10)

Then, do a join on 'id' & 'name' like this:

df_v1.merge(df_v2[["id", "name", "not_ok"]], on=["id", "name"], how="left")

answered Sep 5, 2022 at 10:42

solid

8731 gold badge7 silver badges28 bronze badges

2 Comments

ouroboros1 Over a year ago

Surely, just df_v2['value'].gt(10) is way faster than df_v2["value"].apply(lambda x: x > 10). Chain .astype(int) if you want True | False as 1 | 0.

Gobrel Over a year ago

The poblem is, df_v1 has 800 rows and df_v2 has 5 mio. When I use your solution df_v1 gets also bigger. I want to maintain the v1 dataframe since this is to see how the data quality is. In essence I want the unique id and name with values > 10 to be marked in df_v1. Since there are many dublicate ids some with values under 10 and over 10 it is tricky.

ThePyGuy · Accepted Answer · 2022-09-05 10:41:30Z

0

You can use Series.replace and pass the dictionary, and assign the result to Cat column:

>>> df['Cat'] = df.id.replace(cats)
#output:

    id Full Cat
0  123  Yes   A
1  456   No   B
2  789  Yes   C

answered Sep 5, 2022 at 10:41

ThePyGuy

18.5k5 gold badges24 silver badges55 bronze badges

Comments

Kimb0t · Accepted Answer · 2022-09-05 12:40:57Z

You can use the .lt(10) method to get the values lesser than 10 (labeling values <10 as 1 and values >10 as 0). Then you group by ids using the min() function to keep the minimum value (0 here) in case of duplicate ids in the second DataFrame. Here is the code :

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8], 
                    'name': ['BEL', 'BEL', 'BEL', 'NYC', 'NYC', 'NYC', 'BER', 'BER']})
df2 = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 3, 4, 5, 6],
                    'name': ['BEL', 'BEL', 'BEL', 'NYC', 'NYC', 'NYC', 'BER', 'BER', 'BEL', 'NYC', 'NYC', 'NYC'],
                   'value': [9, 7, 3, 76, 76, 23, 76, 2, 7, 5, 4, 2]})

df2['is_ok'] = df2['value'].lt(10).astype(int)
df3 = df2[['id', 'name', 'is_ok']].groupby('id').min().reset_index()

print(df3)
# If you want to merge it with the first DataFrame
# df1 = df1.merge(df3[["id", "is_ok"]], on=["id"])
# print(df1)

Output :

    id name  is_ok
0   1  BEL      1
1   2  BEL      1
2   3  BEL      1
3   4  NYC      0
4   5  NYC      0
5   6  NYC      0
6   7  BER      0
7   8  BER      1

Collectives™ on Stack Overflow

How to add column value based on condition in another dataframe?

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related