2

Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.

I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.

for example:

first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})

first_df

  code1 code2
0   1   10
1   1   22
2   2   15
3   2   15
4   3   7
5   1   130
6   1   2

second_df

  code1 code2_end   code2_start
0   1   15          5
1   1   25          20
2   2   20          11
3   2   20          11
4   3   10          5
5   1   120         110
6   1   230         220

For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:

in row 1 of first_df code1=1 and code2=22

checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.

Considering an example where the function should return False,

in row 5 of first_df code1=1 and code2=130

but there is no interval containing 130 where code1=1

I have tried to use this function

def check(first_df,second_df):
    for i in range(len(first_df):
        return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()

and to vectorize it

first_df['output'] = np.vectorize(check)(first_df, second_df)

but obviously with no success.

I would be happy for any input you could provide.

thx.

A.

As a practical example:

first_df.code1[0] = 1

therefore I need to search on second_df all the istances where

second_df.code1 == first_df.code1[0]
0     True
1     True
2    False
3    False
4    False
5     True
6     True

for the instances 0,1,5,6 where the status is True I need to check if the value

first_df.code2[0]
10

is between one of the range identified by

second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
    code2_start code2_end
0   5           15
1   20          25
5   110         120
6   220         230

since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.

1 Answer 1

2
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)

This works because when you do something like: second_df.code2_start <= first_df.code2

You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.

Here's an example:

>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
   1  2   3 output
0  2  4   6   True
1  3  6   9   True
2  4  8  10   True

EDIT:

So based on your updated question and my new interpretation of your problem, I would do something like this:

import pandas as pd

# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])

# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
    idx = code_range.c1 == x.c1
    code_range = code_range.loc[idx]
    check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
    return check.any()

# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)

What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:

idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]

Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:

check = (code_range.start <= x.c2) & (code_range.end >= x.c2)

Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.

To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.

Sign up to request clarification or add additional context in comments.

5 Comments

Val, thank you for coming back, but I should have made clearer that: first_df and second_df have different length moreover first_df.code1 determine the intervals I need to check the validity of first_df.code2 against. For example for code1 I may have 4 different intervals identified in second_df and I need to check code2 against all intervals. I've tried to use the boolean operators but I'm struggling to find a solution. Thx. A.
I think I understand what you mean, but could you provide an example of the desired output for this operation in the original question?
Thank you Val, I've just updated the above question with a more practical example.
I updated my answer with a solution to your problem.
Thank you extremely clear and it works well, even if on a dataframe of more than 1,000,000 lines may take a bit, but it is not too bad!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.