Select rows in one DataFrame based on rows in another

Question

Let's assume I have a very large pandas DataFrame dfBig with columns Param1, Param2, ..., ParamN, score, step, and a smaller DataFrame dfSmall with columns Param1, Param2, ..., ParamN (i.e. missing the score and step columns).

I want to select all the rows of dfBig for which the values of columns Param1, Param2, ..., ParamN match those of some row in dfSmall. Is there a clean way of doing this in pandas?

Edit: To give an example, consider this DataFrame dfBig:

Arch  | Layers | Score | Time
A     | 1      | 0.3   | 10
A     | 1      | 0.6   | 20
A     | 1      | 0.7   | 30
A     | 2      | 0.4   | 10
A     | 2      | 0.5   | 20
A     | 2      | 0.6   | 30
B     | 1      | 0.1   | 10
B     | 1      | 0.2   | 20
B     | 1      | 0.7   | 30
B     | 2      | 0.7   | 10
B     | 2      | 0.8   | 20
B     | 2      | 0.8   | 30

Let's imagine a model is specified by a pair (Arch, Layers). I want to query dfBig and get the time series for scores over time for the best performing models with Arch A and Arch B.

Following EdChum's answer below, I take it that the best solution is to do something like this procedurally:

modelColumns = [col for col in dfBigCol if col not in ["Time", "Score"]]    
groupedBest = dfBig.groupby("Arch").Score.max()
dfSmall = pd.DataFrame(groupedBest).reset_index()[modelColumns]
dfBest = pd.merge(dfSmall, dfBig)

which yields:

Arch  | Layers | Score | Time
A     | 1      | 0.3   | 10
A     | 1      | 0.6   | 20
A     | 1      | 0.7   | 30
B     | 2      | 0.7   | 10
B     | 2      | 0.8   | 20
B     | 2      | 0.8   | 30

If there's a better way to do this, I'm happy to hear it.

It'd be helpful if you post a small sample of what you want to achieve and also define what match means — EdChum
– EdChum, Commented May 12, 2015 at 10:44
EdChum: I've added some details in the post. Thanks for your help thus far! — Edward Grefenstette
– Edward Grefenstette, Commented May 14, 2015 at 13:07

EdChum · Accepted Answer · 2015-05-12 10:42:47Z

1

If I understand your question correctly you should be able to just call merge on dfBig and pass dfSmall which will look for matches in the aligned columns and only return those rows.

Example:

In [71]:

dfBig = pd.DataFrame({'a':np.arange(100), 'b':np.arange(100), 'c':np.arange(100)})
dfSmall = pd.DataFrame({'a':[3,4,5,6]})
dfBig.merge(dfSmall)
Out[71]:
   a  b  c
0  3  3  3
1  4  4  4
2  5  5  5
3  6  6  6

answered May 12, 2015 at 10:42

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Edward Grefenstette Over a year ago

Yes, I'm an idiot. A few minutes after posting this I thought "this sounds like an SQL merge" and google pandas merge. Thanks for the answer!

Collectives™ on Stack Overflow

Select rows in one DataFrame based on rows in another

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related