Index of identical rows in a NumPy array

Question

I already asked a variation of this question, but I still have a problem regarding the runtime of my code.

Given a numpy array consisting of 15000 rows and 44 columns. My goal is to find out which rows are equal and add them to a list, like this:

Result:

equal_rows1 = [1,2,3]
equal_rows2 = [0,4]

What I did up till now is using the following code:

import numpy as np


input_data = np.load('IN.npy')

equal_inputs1 = []
equal_inputs2 = []

for i in range(len(input_data)):
  for j in range(i+1,len(input_data)):
     if np.array_equal(input_data[i],input_data[j]):
        equal_inputs1.append(i)
        equal_inputs2.append(j)

The problem is that it takes a lot of time to return the desired arrays and that this allows only 2 different "similar row lists" although there can be more. Is there any better solution for this, especially regarding the runtime?

cs95 · Accepted Answer · 2019-01-10 15:54:17Z

1

This is pretty simple with pandas groupby:

df
   A  B  C  D  E
0  1  0  0  0  0
1  0  0  0  0  0
2  0  0  0  0  0
3  0  0  0  0  0
4  1  0  0  0  0
5  1  2  3  4  5

[g.index.tolist() for _, g in df.groupby(df.columns.tolist()) if len(g.index) > 1]
# [[1, 2, 3], [0, 4]]

If you are dealing with many rows and many unique groups, this might get a bit slow. The performance depends on your data. Perhaps there is a faster NumPy alternative, but this is certainly the easiest to understand.

answered Jan 10, 2019 at 15:54

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dorian IL Over a year ago

A bit late, but this works perfectly. I thank you very much!

jpp · Accepted Answer · 2019-01-10 15:57:16Z

1

You can use collections.defaultdict, which retains the row values as keys:

from collections import defaultdict

dd = defaultdict(list)

for idx, row in enumerate(df.values):
    dd[tuple(row)].append(idx)

print(list(dd.values()))
# [[0, 4], [1, 2, 3], [5]]

print(dd)
# defaultdict(<class 'list'>, {(1, 0, 0, 0, 0): [0, 4],
#                              (0, 0, 0, 0, 0): [1, 2, 3],
#                              (1, 2, 3, 4, 5): [5]})

You can, if you wish, filter out unique rows via a dictionary comprehension.

answered Jan 10, 2019 at 15:57

jpp

166k37 gold badges301 silver badges362 bronze badges

Collectives™ on Stack Overflow

Index of identical rows in a NumPy array

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related