Let's assume I have a very large pandas DataFrame dfBig with columns Param1, Param2, ..., ParamN, score, step, and a smaller DataFrame dfSmall with columns Param1, Param2, ..., ParamN (i.e. missing the score and step columns).
I want to select all the rows of dfBig for which the values of columns Param1, Param2, ..., ParamN match those of some row in dfSmall. Is there a clean way of doing this in pandas?
Edit: To give an example, consider this DataFrame dfBig:
Arch | Layers | Score | Time
A | 1 | 0.3 | 10
A | 1 | 0.6 | 20
A | 1 | 0.7 | 30
A | 2 | 0.4 | 10
A | 2 | 0.5 | 20
A | 2 | 0.6 | 30
B | 1 | 0.1 | 10
B | 1 | 0.2 | 20
B | 1 | 0.7 | 30
B | 2 | 0.7 | 10
B | 2 | 0.8 | 20
B | 2 | 0.8 | 30
Let's imagine a model is specified by a pair (Arch, Layers). I want to query dfBig and get the time series for scores over time for the best performing models with Arch A and Arch B.
Following EdChum's answer below, I take it that the best solution is to do something like this procedurally:
modelColumns = [col for col in dfBigCol if col not in ["Time", "Score"]]
groupedBest = dfBig.groupby("Arch").Score.max()
dfSmall = pd.DataFrame(groupedBest).reset_index()[modelColumns]
dfBest = pd.merge(dfSmall, dfBig)
which yields:
Arch | Layers | Score | Time
A | 1 | 0.3 | 10
A | 1 | 0.6 | 20
A | 1 | 0.7 | 30
B | 2 | 0.7 | 10
B | 2 | 0.8 | 20
B | 2 | 0.8 | 30
If there's a better way to do this, I'm happy to hear it.