1

For every row in df_a, I am looking to find rows in df_b where the id's are the same and the df_a row's location falls within the df_b row's start and end location.

df_a looks like:

|---------------------|------------------|------------------|
|      Name           |     id           |     location     |
|---------------------|------------------|------------------|
|          a          |         1        |       202013     |
|---------------------|------------------|------------------|

df_b looks like:

|---------------------|------------------|------------------|------------------|
|      Name           |     id           |  location_start  |  location_end    |
|---------------------|------------------|------------------|------------------|
|          x          |         1        |       202010     |       2020199    |
|---------------------|------------------|------------------|------------------|

Unfortunately, df_a and df_b are both nearly a million rows. This code is taking like 10 hours to run on my local. Currently I'm running the following:

for index,row in df_a.iterrows():
         matched = df_b[(df_b['location_start']<row['location']) 
                  & (df_b['location_end']>row['location']) 
                  & (df_b['id']==row['id'])]

Is there any obvious way to speed this up?

1
  • kindly share more data for both df_a and b ... allows for better test of code Commented Apr 23, 2020 at 12:23

1 Answer 1

1

You can do this:

Consider my sample dataframes below:

In [90]: df_a = pd.DataFrame({'Name':['a','b'], 'id':[1,2], 'location':[202013, 102013]})                                                                                                                   

In [91]: df_b = pd.DataFrame({'Name':['a','b'], 'id':[1,2], 'location_start':[202010, 1020199],'location_end':[2020199, 1020299] })

In [92]: df_a                                                                                                                                                                                               
Out[92]: 
  Name  id  location
0    a   1    202013
1    b   2    102013

In [93]: df_b                                                                                                                                                                                               
Out[93]: 
  Name  id  location_start  location_end
0    a   1          202010       2020199
1    b   2         1020199       1020299

In [95]: d = pd.merge(df_a, df_b, on='id')

In [106]: indexes = d[d['location'].between(d['location_start'], d['location_end'])].index.tolist()                                                                                                         

In [107]: df_b.iloc[indexes, :]                                                                                                                                                                             
Out[107]: 
  Name  id  location_start  location_end
0    a   1          202010       2020199
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.