2

I already asked a variation of this question, but I still have a problem regarding the runtime of my code.

Given a numpy array consisting of 15000 rows and 44 columns. My goal is to find out which rows are equal and add them to a list, like this:

1 0 0 0 0
0 0 0 0 0 
0 0 0 0 0 
0 0 0 0 0 
1 0 0 0 0
1 2 3 4 5

Result:

equal_rows1 = [1,2,3]
equal_rows2 = [0,4]

What I did up till now is using the following code:

import numpy as np


input_data = np.load('IN.npy')

equal_inputs1 = []
equal_inputs2 = []

for i in range(len(input_data)):
  for j in range(i+1,len(input_data)):
     if np.array_equal(input_data[i],input_data[j]):
        equal_inputs1.append(i)
        equal_inputs2.append(j)

The problem is that it takes a lot of time to return the desired arrays and that this allows only 2 different "similar row lists" although there can be more. Is there any better solution for this, especially regarding the runtime?

2 Answers 2

1

This is pretty simple with pandas groupby:

df
   A  B  C  D  E
0  1  0  0  0  0
1  0  0  0  0  0
2  0  0  0  0  0
3  0  0  0  0  0
4  1  0  0  0  0
5  1  2  3  4  5

[g.index.tolist() for _, g in df.groupby(df.columns.tolist()) if len(g.index) > 1]
# [[1, 2, 3], [0, 4]]

If you are dealing with many rows and many unique groups, this might get a bit slow. The performance depends on your data. Perhaps there is a faster NumPy alternative, but this is certainly the easiest to understand.

Sign up to request clarification or add additional context in comments.

1 Comment

A bit late, but this works perfectly. I thank you very much!
1

You can use collections.defaultdict, which retains the row values as keys:

from collections import defaultdict

dd = defaultdict(list)

for idx, row in enumerate(df.values):
    dd[tuple(row)].append(idx)

print(list(dd.values()))
# [[0, 4], [1, 2, 3], [5]]

print(dd)
# defaultdict(<class 'list'>, {(1, 0, 0, 0, 0): [0, 4],
#                              (0, 0, 0, 0, 0): [1, 2, 3],
#                              (1, 2, 3, 4, 5): [5]})

You can, if you wish, filter out unique rows via a dictionary comprehension.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.