1

Given a dataframe with the following format:

TEST_ID | ATOMIC_NUMBER | COMPOSITION_PERCENT | POSITION
1       | 28            | 49.84               | 0
1       | 22            | 50.01               | 0
1       | 47            | 0.06                | 1
2       | 22            | 49.84               | 0
2       | 47            | 50.01               | 1
3       | 28            | 49.84               | 0
3       | 22            | 50.01               | 0
3       | 47            | 0.06                | 0

I want to select only the tests that have ATOMIC_NUMBER of 22 AND 28 in POSITION 0, no more no less. So I'd like a filter that returns:

TEST_ID | ATOMIC_NUMBER | COMPOSITION_PERCENT | POSITION
1       | 28            | 49.84               | 0
1       | 22            | 50.01               | 0
1       | 47            | 0.06                | 1

EDIT: I'm trying to convert this logic from SQL into python. Here's the SQL code:

select * from compositions 
where compositions.test_id in (

  select a.test_id from (

    select test_id from compositions
    where test_id in (
      select test_id from (
        select * from COMPOSITIONS where position == 0 )
      group by test_id
      having count(test_id) = 2 )
    and atomic_number = 22) a

  join (

    select test_id from compositions
    where test_id in (
      select test_id from (
        select * from COMPOSITIONS where position == 0 )
      group by test_id
      having count(test_id) = 2 )
    and atomic_number = 28) b

  on a.test_id = b.test_id )
6
  • 1
    What is 47 doing then? Commented Oct 3, 2018 at 20:16
  • Can you post some code for what you've tried so far? Commented Oct 3, 2018 at 20:20
  • Hi @Vaishali, I think this question is more complicate than the marked one ... Commented Oct 3, 2018 at 20:21
  • So the reason 47 is selected is because it's in the same group as the valid result? Commented Oct 3, 2018 at 20:37
  • 1
    @user3483203 because it has an extra element in position 0 (47). Commented Oct 3, 2018 at 20:49

1 Answer 1

1

You can create a boolean series to capture test_ids and then index the df using the same.

s = df[df['POSITION'] == 0].groupby('TEST_ID').apply(lambda x: ((x['ATOMIC_NUMBER'].count() == 2 ) & (sorted(x['ATOMIC_NUMBER'].values.tolist()) == [22,28])).all())

test_id = s[s].index.tolist()

df[df['TEST_ID'].isin(test_id)]

    TEST_ID ATOMIC_NUMBER   COMPOSITION_PERCENT POSITION
0   1       28              49.84               0
1   1       22              50.01               0
2   1       47              0.06                1
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, I'm going to try this out.
@Brandon, sure. This solution checks for exactly two values in atomic_number - one 22 and another 28 for position 0. If the cond is met, it returns the test_ids which are used to filter the dataframe.
@Brandon, great that its worked and thank you for accepting:)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.