5

I've answered this question several times in the guise of different contexts and I realized that there isn't a good canonical approach specified anywhere.

So, to set up a simple problem:

Problem

df = pd.DataFrame(dict(A=range(6), B=[1, 2] * 3))
print(df)

   A  B
0  0  1
1  1  2
2  2  1
3  3  2
4  4  1
5  5  2

Question:

How do I sort by the product of columns 'A' and 'B'?
Here is an approach where I add a temporary column to the dataframe, use it to sort_values then drop it.

df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)

   A  B
0  0  1
1  1  2
2  2  1
4  4  1
3  3  2
5  5  2

Is there a better, more concise, clearer, more consistent approach?

1 Answer 1

9

TL;DR
iloc + argsort


We can approach this using iloc where we can take an array of ordinal positions and return the dataframe reordered by these positions.

With the power of iloc, we can sort with any array that specifies the order.

Now, all we need to do is identify a method for getting this ordering. Turns out there is a method called argsort which does exactly this. By passing the results of argsort to iloc, we can get our dataframe sorted out.

Example 1

Using the specified problem above

df.iloc[df.prod(1).argsort()]

Same results as above

   A  B
0  0  1
1  1  2
2  2  1
4  4  1
3  3  2
5  5  2

That was for simplicity. We could take this further if performance is an issue and focus on numpy

v = df.values
a = v.prod(1).argsort()
pd.DataFrame(v[a], df.index[a], df.columns)

How fast are these solutions?

enter image description here

We can see that pd_ext_sort is the most concise but does not scale as well as the others.
np_ext_sort gives the best performance at the expense of transparency. Though, I'd argue that it's still very clear what is going on.

backtest setup

def add_drop():
    return df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)

def pd_ext_sort():
    return df.iloc[df.prod(1).argsort()]

def np_ext_sort():
    v = df.values
    a = v.prod(1).argsort()
    return pd.DataFrame(v[a], df.index[a], df.columns)

results = pd.DataFrame(
    index=pd.Index([10, 100, 1000, 10000], name='Size'),
    columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)

for i in results.index:
    df = pd.DataFrame(np.random.rand(i, 2), columns=['A', 'B'])
    for j in results.columns:
        stmt = '{}()'.format(j)
        setup = 'from __main__ import df, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100))

results.plot()

Example 2

Suppose I have a column of negative and positive values. I want to sort by increasing magnitude... however, I want the negatives to come first.

Suppose I have dataframe df

df = pd.DataFrame(dict(A=range(-2, 3)))
print(df)

   A
0 -2
1 -1
2  0
3  1
4  2

I'll set up 3 versions again. This time I'll use np.lexsort which returns the same type of array as argsort. Meaning, I can use it to reorder the dataframe.

Caveat: np.lexsort sorts by the last array in its list first. \shurg

def add_drop():
    return df.assign(P=df.A >= 0, M=df.A.abs()).sort_values(['P', 'M']).drop(['P', 'M'], 1)

def pd_ext_sort():
    v = df.A.values
    return df.iloc[np.lexsort([np.abs(v), v >= 0])]

def np_ext_sort():
    v = df.A.values
    a = np.lexsort([np.abs(v), v >= 0])
    return pd.DataFrame(v[a, None], df.index[a], df.columns)

All of which return

   A
1 -1
0 -2
2  0
3  1
4  2

How fast this time?

enter image description here

In this example, both pd_ext_sort and np_ext_sort outperformed add_drop.

backtest setup

results = pd.DataFrame(
    index=pd.Index([10, 100, 1000, 10000], name='Size'),
    columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)

for i in results.index:
    df = pd.DataFrame(np.random.randn(i, 1), columns=['A'])
    for j in results.columns:
        stmt = '{}()'.format(j)
        setup = 'from __main__ import df, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100))

results.plot(figsize=(15, 6))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.