Remove rows from pandas dataframe with condition

Question

I have a dataframe that looks like this:

import pandas as pd

### create toy data set
data = [[1111,'10/1/2021',21,123],
        [1111,'10/1/2021',-21,123],
        [1111,'10/1/2021',21,123],
        [2222,'10/2/2021',15,234],
        [2222,'10/2/2021',15,234],
        [3333,'10/3/2021',15,234],
        [3333,'10/3/2021',15,234]]

df = pd.DataFrame(data,columns = ['Individual','date','number','cc'])

What I want to do is remove rows where Individual, date, and cc are the same, but number is a negative value in one case and a positive in the other case. For example, in the first three rows, I would remove rows 1 and 2 (because 21 and -21 values are equal in absolute terms), but I don't want to remove row 3 (because I have already accounted for the negative value in row 2 by eliminating row 1). Also, I don't want to remove duplicated values if the corresponding number values are positive. I have tried a variety of duplicated() approaches, but just can't get it right.

Expected results would be:

  Individual       date  number   cc
0        1111  10/1/2021      21  123
1        2222  10/2/2021      15  234
2        2222  10/2/2021      15  234
3        3333  10/3/2021      15  234
4        3333  10/3/2021      15  234

Thus, the first two rows are removed, but not the third row, since the negative value is already accounted for.

Any assistance would be appreciated. I am trying to do this without a loop, but it may be unavoidable. It seems similar to this question, but I can't figure out how to make it work in my case, as I am trying to avoid loops.

Will the positive and negative values always be equal and zero out as in your example? And will other god rows ever be zero? — G. Anderson
– G. Anderson, Commented Oct 21, 2021 at 16:37
Yes, I have edited the question that the positive and negative values that would count as duplicates (and thus removed) would be equal to zero. — user44796
– user44796, Commented Oct 21, 2021 at 16:39

sophocles · Accepted Answer · 2021-10-21 18:28:23Z

1

I can't be sure since you did not post your expected output, but you could try the below. Create a separate df called n that contains the rows with -ve 'number' and join it to the original with indicator=True.

n = df.loc[df.number.le(0)].drop('number',axis=1)
df = pd.merge(df,n,'left',indicator=True)

>>> df

   Individual       date  number   cc     _merge
0        1111  10/1/2021      21  123       both
1        1111  10/1/2021     -21  123       both
2        1111  10/1/2021      21  123       both
3        2222  10/2/2021      15  234  left_only
4        2222  10/2/2021      15  234  left_only
5        3333  10/3/2021      15  234  left_only
6        3333  10/3/2021      15  234  left_only

This will allow us to identify the Individual/date/cc groups that have a -ve 'number' row.

Then you can locate the rows with 'both' in _merge, and only use those to perform a groupby.head(2), concatenating that with the rest of the df:

out = pd.concat([df.loc[df._merge.eq('both')].groupby(['Individual','date','cc']).head(2),
           df.loc[df._merge.ne('both')]]).drop('_merge',axis=1)

Which prints:

   Individual       date  number   cc
0        1111  10/1/2021      21  123
1        1111  10/1/2021     -21  123
3        2222  10/2/2021      15  234
4        2222  10/2/2021      15  234
5        3333  10/3/2021      15  234
6        3333  10/3/2021      15  234

answered Oct 21, 2021 at 18:28

sophocles

13.9k3 gold badges18 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user44796 Over a year ago

I added the expected output. What you have does not do it, but I see what you are doing and maybe can be edited to get the expected results.

sophocles Over a year ago

if you change head(2) to head(1), gets your desired output. However, is that always going to be the case? Keeping the first two rows of the group that has a negative number? You will probably need to use sort_values with head to make sure it works. The data you provided is not enough to account for all the different scenarios however, which means that with head(1), you do get your desired outcome, but in your real data set it might not work. So please consider providing more information and more concrete examples.

Collectives™ on Stack Overflow

Remove rows from pandas dataframe with condition

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related