0

I got a numpy array as below:

[[3.4, 87]
 [5.5, 11]
 [22, 3]
 [4, 9.8]
 [41, 11.22]
 [32, 7.6]]

and I want to:

  1. compare elements in column 2, 3 rows a time
  2. delete the row with the biggest value in column 2, 3 rows a time

For example, in the first 3 rows, 3 values in column 2 are 87, 11 and 3, respectively, and I would like to remain 11 and 3.

The output numpy array I expected would be:

[[5.5, 11]
 [22, 3]
 [4, 9.8]
 [32, 7.6]]

I am new to numpy array, and please give me advice to achieve this.

1 Answer 1

1
import numpy as np
x = np.array([[3.4, 87],
              [5.5, 11],
              [22, 3],
              [4, 9.8],
              [41, 11.22],
              [32, 7.6]])

y = x.reshape(-1,3,2)
idx = y[..., 1].argmax(axis=1)
mask = np.arange(3)[None, :] != idx[:, None]
y = y[mask]
print(y)
# This might be helpful for the deleted part of your question
# y = y.reshape(-1,2,2)
# z = y[...,1]/y[...,1].sum(axis=1)
# result = np.dstack([y, z[...,None]])

yields

[[  5.5  11. ]
 [ 22.    3. ]
 [  4.    9.8]
 [ 32.    7.6]]

"Grouping by three" with NumPy can be done by reshaping the array to create a new axis of length 3 -- provided the original number of rows is divisible by 3:

In [92]: y = x.reshape(-1,3,2); y
Out[92]: 
array([[[  3.4 ,  87.  ],
        [  5.5 ,  11.  ],
        [ 22.  ,   3.  ]],

       [[  4.  ,   9.8 ],
        [ 41.  ,  11.22],
        [ 32.  ,   7.6 ]]])

In [93]: y.shape
Out[93]: (2, 3, 2)  
          |  |  |
          |  |  o--- 2 columns in each group
          |  o------ 3 rows in each group
          o--------- 2 groups

For each group, we can select the second column and find the row with the maximum value:

In [94]: idx = y[..., 1].argmax(axis=1); idx
Out[94]: array([0, 1])

array([0, 1]) indicates that in the first group, the 0th indexed row contains the maximum (i.e. 87), and in the second group, the 1st indexed row contains the maximum (i.e. 11.22).

Next, we can generate a 2D boolean selection mask which is True where the rows do not contain the maximum value:

In [95]: mask = np.arange(3)[None, :] != idx[:, None]; mask
Out[95]: 
array([[False,  True,  True],
       [ True, False,  True]], dtype=bool)

In [96]: mask.shape
Out[96]: (2, 3)

mask has shape (2,3). y has shape (2,3,2). If mask is used to index y as in y[mask], then the mask is aligned with the first two axes of y, and all values where mask is True are returned:

In [98]: y[mask]
Out[98]: 
array([[  5.5,  11. ],
       [ 22. ,   3. ],
       [  4. ,   9.8],
       [ 32. ,   7.6]])

In [99]: y[mask].shape
Out[99]: (4, 2)

By the way, the same calculation could be done using Pandas like this:

import numpy as np
import pandas as pd
x = np.array([[3.4, 87],
              [5.5, 11],
              [22, 3],
              [4, 9.8],
              [41, 11.22],
              [32, 7.6]])

df = pd.DataFrame(x)
idx = df.groupby(df.index // 3)[1].idxmax()
# drop the row with the maximum value in each group
df = df.drop(idx.values, axis=0)

which yields the DataFrame:

      0     1
1   5.5  11.0
2  22.0   3.0
3   4.0   9.8
5  32.0   7.6

You might find Pandas syntax easier to use, but for the above calculation NumPy is faster.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the effective answer and detailed description, and I think I need time to fully understand them.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.