Compare elements in a numpy array 3 rows a time

Question

I got a numpy array as below:

[[3.4, 87]
 [5.5, 11]
 [22, 3]
 [4, 9.8]
 [41, 11.22]
 [32, 7.6]]

and I want to:

compare elements in column 2, 3 rows a time
delete the row with the biggest value in column 2, 3 rows a time

For example, in the first 3 rows, 3 values in column 2 are 87, 11 and 3, respectively, and I would like to remain 11 and 3.

The output numpy array I expected would be:

[[5.5, 11]
 [22, 3]
 [4, 9.8]
 [32, 7.6]]

I am new to numpy array, and please give me advice to achieve this.

unutbu · Accepted Answer · 2016-10-30 13:45:09Z

import numpy as np
x = np.array([[3.4, 87],
              [5.5, 11],
              [22, 3],
              [4, 9.8],
              [41, 11.22],
              [32, 7.6]])

y = x.reshape(-1,3,2)
idx = y[..., 1].argmax(axis=1)
mask = np.arange(3)[None, :] != idx[:, None]
y = y[mask]
print(y)
# This might be helpful for the deleted part of your question
# y = y.reshape(-1,2,2)
# z = y[...,1]/y[...,1].sum(axis=1)
# result = np.dstack([y, z[...,None]])

yields

[[  5.5  11. ]
 [ 22.    3. ]
 [  4.    9.8]
 [ 32.    7.6]]

"Grouping by three" with NumPy can be done by reshaping the array to create a new axis of length 3 -- provided the original number of rows is divisible by 3:

In [92]: y = x.reshape(-1,3,2); y
Out[92]: 
array([[[  3.4 ,  87.  ],
        [  5.5 ,  11.  ],
        [ 22.  ,   3.  ]],

       [[  4.  ,   9.8 ],
        [ 41.  ,  11.22],
        [ 32.  ,   7.6 ]]])

In [93]: y.shape
Out[93]: (2, 3, 2)  
          |  |  |
          |  |  o--- 2 columns in each group
          |  o------ 3 rows in each group
          o--------- 2 groups

For each group, we can select the second column and find the row with the maximum value:

In [94]: idx = y[..., 1].argmax(axis=1); idx
Out[94]: array([0, 1])

array([0, 1]) indicates that in the first group, the 0th indexed row contains the maximum (i.e. 87), and in the second group, the 1st indexed row contains the maximum (i.e. 11.22).

Next, we can generate a 2D boolean selection mask which is True where the rows do not contain the maximum value:

In [95]: mask = np.arange(3)[None, :] != idx[:, None]; mask
Out[95]: 
array([[False,  True,  True],
       [ True, False,  True]], dtype=bool)

In [96]: mask.shape
Out[96]: (2, 3)

mask has shape (2,3). y has shape (2,3,2). If mask is used to index y as in y[mask], then the mask is aligned with the first two axes of y, and all values where mask is True are returned:

In [98]: y[mask]
Out[98]: 
array([[  5.5,  11. ],
       [ 22. ,   3. ],
       [  4. ,   9.8],
       [ 32. ,   7.6]])

In [99]: y[mask].shape
Out[99]: (4, 2)

By the way, the same calculation could be done using Pandas like this:

import numpy as np
import pandas as pd
x = np.array([[3.4, 87],
              [5.5, 11],
              [22, 3],
              [4, 9.8],
              [41, 11.22],
              [32, 7.6]])

df = pd.DataFrame(x)
idx = df.groupby(df.index // 3)[1].idxmax()
# drop the row with the maximum value in each group
df = df.drop(idx.values, axis=0)

which yields the DataFrame:

      0     1
1   5.5  11.0
2  22.0   3.0
3   4.0   9.8
5  32.0   7.6

You might find Pandas syntax easier to use, but for the above calculation NumPy is faster.

Thank you for the effective answer and detailed description, and I think I need time to fully understand them.

Collectives™ on Stack Overflow

Compare elements in a numpy array 3 rows a time

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related