8

I have a dataframe like this:

Date                PlumeO      Distance
2014-08-13 13:48:00  754.447905 5.844577 
2014-08-13 13:48:00  754.447905 6.888653
2014-08-13 13:48:00  754.447905 6.938860
2014-08-13 13:48:00  754.447905 6.977284
2014-08-13 13:48:00  754.447905 6.946430 
2014-08-13 13:48:00  754.447905 6.345506
2014-08-13 13:48:00  754.447905 6.133567
2014-08-13 13:48:00  754.447905 5.846046 
2014-08-13 16:59:00  754.447905 6.345506 
2014-08-13 16:59:00  754.447905 6.694847 
2014-08-13 16:59:00  754.447905 5.846046 
2014-08-13 16:59:00  754.447905 6.977284 
2014-08-13 16:59:00  754.447905 6.938860 
2014-08-13 16:59:00  754.447905 5.844577 
2014-08-13 16:59:00  754.447905 6.888653 
2014-08-13 16:59:00  754.447905 6.133567 
2014-08-13 16:59:00  754.447905 6.946430

I'm trying to keep the date with the smallest distance, so drop the duplicates dates and keep the with the smallest distance.

Is there a way to achieve this in pandas' df.drop_duplicates or am I stuck using if statements to find the smallest distance?

3 Answers 3

13

Sort by distances and drop by dates:

df.sort_values('Distance').drop_duplicates(subset='Date', keep='first')
Out: 
                   Date      PlumeO  Distance
0   2014-08-13 13:48:00  754.447905  5.844577
13  2014-08-13 16:59:00  754.447905  5.844577
Sign up to request clarification or add additional context in comments.

1 Comment

Despite having to sort, this answer is also really fast (-:
7

The advantage of these approaches is that it does not require a sort.

Option 1
You can identify the index values for the minimum values with idxmin and you can use it within a groupby. Use these results to slice your dataframe.

df.loc[df.groupby('Date').Distance.idxmin()]

                   Date      PlumeO  Distance
0   2014-08-13 13:48:00  754.447905  5.844577
13  2014-08-13 16:59:00  754.447905  5.844577

Option 2
You can use pd.DataFrame.nsmallest to return the rows associated with the smallest distance.

df.groupby('Date', group_keys=False).apply(
    pd.DataFrame.nsmallest, n=1, columns='Distance'
)

                   Date      PlumeO  Distance
0   2014-08-13 13:48:00  754.447905  5.844577
13  2014-08-13 16:59:00  754.447905  5.844577

Comments

1

I would say sort the data first and then drop the duplicate dates:

stripped_data = df.sort_values('distance').drop_duplicates('date', keep='first')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.