0

I have a dataframe with the main fixed location data:

id    name 
1      BEL
2      BEL
3      BEL
4      NYC
5      NYC
6      NYC
7      BER
8      BER      

I also have second dataframe where I get values for each id and city like this (notice, this dataframe is longer than the main dataframe):

id    name  value
1      BEL   9
2      BEL   7
3      BEL   3
4      NYC   76
5      NYC   76
6      NYC   23
7      BER   76
8      BER   2 
3      BEL   7
4      NYC   5
5      NYC   4
6      NYC   2

My goal is, I want to check the second dataframe if the values are greater than 10 or not. If greater than 10 I want to add to the first dataframe a column ['not_ok'] like 1 for not ok. How can I do this?

I can filter the second dataframe with dff['not_ok'] = np.where(dff['value'] > 10, '1', '0') but since the dff is much longer I don't know how to get that information in the first dataframe.

My goal looks something like this:

id    name  is_ok
1      BEL   1
2      BEL   1
3      BEL   1
4      NYC   0
5      NYC   0
6      NYC   0
7      BER   0
8      BER   1  
1
  • 1
    Kindly add your expected output dataframe Commented Sep 5, 2022 at 10:25

4 Answers 4

1

To reach the desired output you could try as follows:

import pandas as pd

data = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8}, 
        'name': {0: 'BEL', 1: 'BEL', 2: 'BEL', 3: 'NYC', 4: 'NYC', 
                 5: 'NYC', 6: 'BER', 7: 'BER'}
        }
df = pd.DataFrame(data)

data2 = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 
                7: 8, 8: 3, 9: 4, 10: 5, 11: 6}, 
         'name': {0: 'BEL', 1: 'BEL', 2: 'BEL', 3: 'NYC', 4: 'NYC', 
                  5: 'NYC', 6: 'BER', 7: 'BER', 8: 'BEL', 9: 'NYC', 
                  10: 'NYC', 11: 'NYC'}, 
         'value': {0: 9, 1: 7, 2: 3, 3: 76, 4: 76, 5: 23, 6: 76, 
                   7: 2, 8: 7, 9: 5, 10: 4, 11: 2}
         }
df2 = pd.DataFrame(data2)

df = df.merge(df2[df2['value'].gt(10)], on=['id', 'name'], how='left')\
    .rename(columns={'value':'is_ok'})
df['is_ok'] = df['is_ok'].isna().astype(int)

print(df)

   id name  is_ok
0   1  BEL      1
1   2  BEL      1
2   3  BEL      1
3   4  NYC      0
4   5  NYC      0
5   6  NYC      0
6   7  BER      0
7   8  BER      1

Explanation:

  • Use Series.gt to get a boolean pd.Series, which we use to select from d2 only the rows that meet the condition value > 10.
  • Use df.merge to merge this slice from df2 with df and rename column value to is_ok (df.rename).
  • We now have a column with NaN values where there is no match on id, name, and values > 10 where there is. Use Series.isna to turn this column into booleans.
  • Finally, we can chain .astype(int) to change True | False into 1 | 0.
Sign up to request clarification or add additional context in comments.

Comments

1

Suppose you first (shorter) daraframe is called 'df_v1' and the second (longer) is called 'df_v2'.

On 'df_v2' prepare the column like this:

df_v2["not_ok"] = df_v2["value"].apply(lambda x: x > 10)

Then, do a join on 'id' & 'name' like this:

df_v1.merge(df_v2[["id", "name", "not_ok"]], on=["id", "name"], how="left")

2 Comments

Surely, just df_v2['value'].gt(10) is way faster than df_v2["value"].apply(lambda x: x > 10). Chain .astype(int) if you want True | False as 1 | 0.
The poblem is, df_v1 has 800 rows and df_v2 has 5 mio. When I use your solution df_v1 gets also bigger. I want to maintain the v1 dataframe since this is to see how the data quality is. In essence I want the unique id and name with values > 10 to be marked in df_v1. Since there are many dublicate ids some with values under 10 and over 10 it is tricky.
0

You can use Series.replace and pass the dictionary, and assign the result to Cat column:

>>> df['Cat'] = df.id.replace(cats)
#output:

    id Full Cat
0  123  Yes   A
1  456   No   B
2  789  Yes   C

Comments

0

You can use the .lt(10) method to get the values lesser than 10 (labeling values <10 as 1 and values >10 as 0). Then you group by ids using the min() function to keep the minimum value (0 here) in case of duplicate ids in the second DataFrame. Here is the code :

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8], 
                    'name': ['BEL', 'BEL', 'BEL', 'NYC', 'NYC', 'NYC', 'BER', 'BER']})
df2 = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 3, 4, 5, 6],
                    'name': ['BEL', 'BEL', 'BEL', 'NYC', 'NYC', 'NYC', 'BER', 'BER', 'BEL', 'NYC', 'NYC', 'NYC'],
                   'value': [9, 7, 3, 76, 76, 23, 76, 2, 7, 5, 4, 2]})

df2['is_ok'] = df2['value'].lt(10).astype(int)
df3 = df2[['id', 'name', 'is_ok']].groupby('id').min().reset_index()

print(df3)
# If you want to merge it with the first DataFrame
# df1 = df1.merge(df3[["id", "is_ok"]], on=["id"])
# print(df1)

Output :

    id name  is_ok
0   1  BEL      1
1   2  BEL      1
2   3  BEL      1
3   4  NYC      0
4   5  NYC      0
5   6  NYC      0
6   7  BER      0
7   8  BER      1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.