Pandas Dataframe: Creating column based on multiple conditions

Question

Edited

I'm sorry I didn't post it right the first time. The solutions suggested worked if there were only two entries of each Lead ID with Lead Status "A". I am changing my data. I again apologize.

Data:

Lead ID     Lead Status      Duration     Target
1-1H9C0XL   Too Small       -0.466177     1
1-1H9G33C   A               -0.620709     0 
1-1H9G33C   A               -0.500709     0
1-1H9G33C   A                0.337401     0
4-1HFORF8   No Fit          -0.343840     1
4-1HFSXOG   No Fit          -0.124920     1
4-1HLQ2IJ   A               -0.330962     0 
4-1HLQ2IJ   A                0.130818     0
4-1HLQ2IJ   A               -0.400817     0
4-1HLQ2IJ   A                0.240818     0

I want to accomplish following:

If there is a duplicate in the Lead ID and Lead Status, make all the Target values "1" for that LeadID with shorter Duration.

Desired Output

Lead ID     Lead Status      Duration     Target
1-1H9C0XL   Too Small       -0.466177     1
1-1H9G33C   A               -0.620709     1 
1-1H9G33C   A               -0.500709     1
1-1H9G33C   A                0.337401     0
4-1HFORF8   No Fit          -0.343840     1
4-1HFSXOG   No Fit          -0.124920     1
4-1HLQ2IJ   A               -0.330962     1 
4-1HLQ2IJ   A                0.130818     1
4-1HLQ2IJ   A               -0.400817     1
4-1HLQ2IJ   A                0.240818     0

I am not able to implement a condition of checking for duplicates and the value in duration to update the last column. I appreciate any assistance a lot.

You want all but the largest of the group to be 1? And the largest to be 0? — Ted Petrou
– Ted Petrou, Commented Sep 12, 2017 at 21:07

BENY · Accepted Answer · 2017-09-12 21:16:51Z

1

Try this(assuming your df is sorted)

df.loc[df[df.duplicated(['LeadID','LeadStatus'],keep=False)].drop_duplicates(['LeadID','LeadStatus'],keep='first').index,'Target']=1
df
Out[895]: 
      LeadID LeadStatus  Duration  Target
0  1-1H9C0XL   TooSmall    -0.466       1
1  1-1H9G33C          A    -0.621       1
2  1-1H9G33C          A     0.337       0
3  4-1HFORF8      NoFit    -0.344       1
4  4-1HFSXOG      NoFit    -0.125       1
5  4-1HLQ2IJ          A    -0.331       1
6  4-1HLQ2IJ          A     0.241       0

Update

df=df.sort_values(['LeadID','LeadStatus','Duration'])

df.loc[df[df.duplicated(['LeadID','LeadStatus'],keep='last')].index,'Target']=1

Out[911]: 
      LeadID LeadStatus  Duration  Target
0  1-1H9C0XL   TooSmall    -0.466       1
1  1-1H9G33C          A    -0.621       1
2  1-1H9G33C          A    -0.501       1
3  1-1H9G33C          A     0.337       0
4  4-1HFORF8      NoFit    -0.344       1
5  4-1HFSXOG      NoFit    -0.125       1
8  4-1HLQ2IJ          A    -0.401       1
6  4-1HLQ2IJ          A    -0.331       1
7  4-1HLQ2IJ          A     0.131       1
9  4-1HLQ2IJ          A     0.241       0

edited Sep 12, 2017 at 21:16

answered Sep 12, 2017 at 20:40

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ted Petrou Over a year ago

@KrishnangKDalal This only works if you assume that the unique rows have 1 already.

Ted Petrou Over a year ago

There's also a much simpler way to do this instead of .loc. Just add the inner dropduplicated code to target. See my answer

Ted Petrou Over a year ago

Theres no need to sort by all those columns either. The automatic alignment of indexes will take care of the proper place for the data.

Krishnang K Dalal Over a year ago

@TedPetrou: yes, I am aware and I have created the target column with just 1 and then changing them 0 according to the solution

Ted Petrou · Accepted Answer · 2017-09-12 21:32:48Z

Here is an idiomatic and performant answer.

df['Target'] += df.sort_values('Duration')\
                  .duplicated(subset=['Lead ID', 'Lead Status'], keep='last')

If you don't assume unique rows have a 1 then you can do the following.

df1 = df.sort_values('Duration')
unique = ~df1.duplicated(subset=['Lead ID', 'Lead Status'], keep=False) * 1
first = df1.duplicated(subset=['Lead ID', 'Lead Status'], keep='last') * 1
df['Target'] = unique + first

And a less performant way:

df.groupby(['Lead ID', 'Lead Status'])['Duration']\
  .transform(lambda x: 1 if len(x) == 1 else x < x.max())

     Lead ID Lead Status  Duration  Target
0  1-1H9C0XL   Too Small -0.466177       1
1  1-1H9G33C           A -0.620709       1
2  1-1H9G33C           A -0.500709       0
3  1-1H9G33C           A  0.337401       1
4  4-1HFORF8      No Fit -0.343840       1
5  4-1HFSXOG      No Fit -0.124920       1
6  4-1HLQ2IJ           A -0.330962       1
7  4-1HLQ2IJ           A  0.130818       1
8  4-1HLQ2IJ           A -0.400817       1
9  4-1HLQ2IJ           A  0.240818       0

Collectives™ on Stack Overflow

Pandas Dataframe: Creating column based on multiple conditions

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related