pandas dataframe filtering multiple columns and rows

Question

Given a dataframe with the following format:

TEST_ID | ATOMIC_NUMBER | COMPOSITION_PERCENT | POSITION
1       | 28            | 49.84               | 0
1       | 22            | 50.01               | 0
1       | 47            | 0.06                | 1
2       | 22            | 49.84               | 0
2       | 47            | 50.01               | 1
3       | 28            | 49.84               | 0
3       | 22            | 50.01               | 0
3       | 47            | 0.06                | 0

I want to select only the tests that have ATOMIC_NUMBER of 22 AND 28 in POSITION 0, no more no less. So I'd like a filter that returns:

TEST_ID | ATOMIC_NUMBER | COMPOSITION_PERCENT | POSITION
1       | 28            | 49.84               | 0
1       | 22            | 50.01               | 0
1       | 47            | 0.06                | 1

EDIT: I'm trying to convert this logic from SQL into python. Here's the SQL code:

select * from compositions 
where compositions.test_id in (

  select a.test_id from (

    select test_id from compositions
    where test_id in (
      select test_id from (
        select * from COMPOSITIONS where position == 0 )
      group by test_id
      having count(test_id) = 2 )
    and atomic_number = 22) a

  join (

    select test_id from compositions
    where test_id in (
      select test_id from (
        select * from COMPOSITIONS where position == 0 )
      group by test_id
      having count(test_id) = 2 )
    and atomic_number = 28) b

  on a.test_id = b.test_id )

Hi @Vaishali, I think this question is more complicate than the marked one ... — BENY
– BENY, Commented Oct 3, 2018 at 20:21
So the reason 47 is selected is because it's in the same group as the valid result? — user3483203
– user3483203, Commented Oct 3, 2018 at 20:37
@user3483203 because it has an extra element in position 0 (47). — Brandon
– Brandon, Commented Oct 3, 2018 at 20:49

Vaishali · Accepted Answer · 2018-10-03 20:53:01Z

1

You can create a boolean series to capture test_ids and then index the df using the same.

s = df[df['POSITION'] == 0].groupby('TEST_ID').apply(lambda x: ((x['ATOMIC_NUMBER'].count() == 2 ) & (sorted(x['ATOMIC_NUMBER'].values.tolist()) == [22,28])).all())

test_id = s[s].index.tolist()

df[df['TEST_ID'].isin(test_id)]

    TEST_ID ATOMIC_NUMBER   COMPOSITION_PERCENT POSITION
0   1       28              49.84               0
1   1       22              50.01               0
2   1       47              0.06                1

edited Oct 3, 2018 at 20:53

answered Oct 3, 2018 at 20:45

Vaishali

38.5k5 gold badges62 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Brandon Over a year ago

Thanks, I'm going to try this out.

Vaishali Over a year ago

@Brandon, sure. This solution checks for exactly two values in atomic_number - one 22 and another 28 for position 0. If the cond is met, it returns the test_ids which are used to filter the dataframe.

Vaishali Over a year ago

@Brandon, great that its worked and thank you for accepting:)

Collectives™ on Stack Overflow

pandas dataframe filtering multiple columns and rows

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related