filtering rows in one dataframe based on two columns of another dataframe

Question

I have two data frames. One dataframe (dfA) looks like:

Name    gender     start_coordinate    end_coordinate    ID      
Peter     M             30                  150           1      
Hugo      M            4500                6000           2      
Jennie    F             300                 700           3

The other dataframe (dfB) looks like

Name        position      string      
Peter         89            aa      
Jennie        568           bb     
Jennie        90            cc

I want to filter data from dfA such that position from dfB falls in the interval of dfA (start coordinate and end coordinate) and names should be same as well. For example, position value of row # 1 of dfB falls in interval specified by row # 1 of dfA and the corresponding name value is also the same therefore, I want this row. In contrast, row # 3 of dfB also falls in interval of row # 1 of dfA but the name value is different therefore, I don't want this record.

The expected out therefore, becomes:

##new_dfA
Name    gender     start_coordinate    end_coordinate    ID      
Peter     M             30                  150           1           
Jennie    F             300                 700           3 

##new_dfB
Name        position      string      
Peter         89            aa      
Jennie        568           bb

In reality, dfB is of size (443068765,10) and dfA is of size (100000,3) therefore, I don't want to use numpy broadcasting because I run into memory error. Is there a way to deal with this problem within pandas framework. Insights will be appreciated.

mozway · Accepted Answer · 2021-09-17 15:07:14Z

2

If you have that many rows, pandas might not be well suited for your application.

That said, if there aren't many rows with identical "Name", you could merge on "Name" and then filter the rows matching your condition:

dfC = dfA.merge(dfB, on='Name')
dfC = dfC[dfC['position'].between(dfC['start_coordinate'], dfC['end_coordinate'])]
dfA_new = dfC[df1.columns]
dfB_new = dfC[df2.columns]

output:

>>> dfA_new
     Name gender  start_coordinate  end_coordinate  ID
0   Peter      M                30             150   1
1  Jennie      F               300             700   3
>>> dfB_new
     Name  position string
0   Peter        89     aa
1  Jennie       568     bb

answered Sep 17, 2021 at 15:07

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Odhian Over a year ago

I agree with @mozway, would be better if you had also an ID on dfB so you could merge on it.

ashkangh Over a year ago

Sorry! I didn't see your answer, but it seems our answer is completely the same :)

John Over a year ago

@Mozway I have many rows with identical names. In total I have 25 names.

ashkangh Over a year ago

No, but it was interesting how similar our answer were :)

John Over a year ago

@mozway, how do you suggest I solve this problem if pandas might not be the most optimal way forward?

|

G.G · Accepted Answer · 2022-12-16 03:15:44Z

0

use pandasql

pd.sql("select df1.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())

   Name gender  start_coordinate  end_coordinate  ID
0   Peter      M                30             150   1
1  Jennie      F               300             700   3


pd.sql("select df2.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())

  Name  position string
0   Peter        89     aa
1  Jennie       568     bb

answered Dec 16, 2022 at 3:15

G.G

7654 silver badges5 bronze badges

Collectives™ on Stack Overflow

filtering rows in one dataframe based on two columns of another dataframe

2 Answers 2

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related