0

I have two data frames. One dataframe (dfA) looks like:

Name    gender     start_coordinate    end_coordinate    ID      
Peter     M             30                  150           1      
Hugo      M            4500                6000           2      
Jennie    F             300                 700           3   

The other dataframe (dfB) looks like

Name        position      string      
Peter         89            aa      
Jennie        568           bb     
Jennie        90            cc

I want to filter data from dfA such that position from dfB falls in the interval of dfA (start coordinate and end coordinate) and names should be same as well. For example, position value of row # 1 of dfB falls in interval specified by row # 1 of dfA and the corresponding name value is also the same therefore, I want this row. In contrast, row # 3 of dfB also falls in interval of row # 1 of dfA but the name value is different therefore, I don't want this record.

The expected out therefore, becomes:

##new_dfA
Name    gender     start_coordinate    end_coordinate    ID      
Peter     M             30                  150           1           
Jennie    F             300                 700           3 

##new_dfB
Name        position      string      
Peter         89            aa      
Jennie        568           bb     

In reality, dfB is of size (443068765,10) and dfA is of size (100000,3) therefore, I don't want to use numpy broadcasting because I run into memory error. Is there a way to deal with this problem within pandas framework. Insights will be appreciated.

2 Answers 2

2

If you have that many rows, pandas might not be well suited for your application.

That said, if there aren't many rows with identical "Name", you could merge on "Name" and then filter the rows matching your condition:

dfC = dfA.merge(dfB, on='Name')
dfC = dfC[dfC['position'].between(dfC['start_coordinate'], dfC['end_coordinate'])]
dfA_new = dfC[df1.columns]
dfB_new = dfC[df2.columns]

output:

>>> dfA_new
     Name gender  start_coordinate  end_coordinate  ID
0   Peter      M                30             150   1
1  Jennie      F               300             700   3
>>> dfB_new
     Name  position string
0   Peter        89     aa
1  Jennie       568     bb
Sign up to request clarification or add additional context in comments.

9 Comments

I agree with @mozway, would be better if you had also an ID on dfB so you could merge on it.
Sorry! I didn't see your answer, but it seems our answer is completely the same :)
@Mozway I have many rows with identical names. In total I have 25 names.
No, but it was interesting how similar our answer were :)
@mozway, how do you suggest I solve this problem if pandas might not be the most optimal way forward?
|
0

use pandasql

pd.sql("select df1.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())

   Name gender  start_coordinate  end_coordinate  ID
0   Peter      M                30             150   1
1  Jennie      F               300             700   3


pd.sql("select df2.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())

  Name  position string
0   Peter        89     aa
1  Jennie       568     bb

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.