Let's suppose we have a binary matrix A with shape n x m,
I want to identify rows that have duplicates in the matrix, i.e. there is another index on the same dimension with the same elements in the same positions.
It's very important not to convert this matrix into a dense representation, since the real matrices I'm using are quite large and difficult to handle in terms of memory.
Using PyTorch for the implementation:
# This is just a toy sparse binary matrix with n = 10 and m = 100
A = torch.randint(0, 2, (10, 100), dtype=torch.float32).to_sparse()
Intuitively, we can perform the dot product of this matrix producing a new m x m matrix which contains in terms i, j, the number of 1s that the index i has in the same position of the index j at dimension 0.
B = A.T @ A # In PyTorch, this operation will also produce a sparse representation
At this point, I've tried to combine these values, comparing them with A.sum(0),
num_elements = A.sum(0)
duplicate_rows = torch.logical_and([
num_elements[B.indices()[0]] == num_elements[B.indices()[1]],
num_elements[B.indices()[0]] == B.values()
])
But this did not work!
I think that the solution can be written only by using operations on PyTorch Sparse tensors (without using Python loops and so on), and this could also be a benefit in terms of performance.
B = A.T @ Awould compare columns, while[email protected]would compare rows. Are you just trying to indicate duplicate rows in general, or do you want to assign duplicate rows to groups? Have you looked into writing loops at the python level and using torchscript to compile?sumis performed by @ with an appropriate array, the result is a dense array.uniqueon that array sorts it, and checks for adjacent duplicates. That kind of sort could reduce the search space for duplicates. With CSR format you could also deduce the row length from theindptrsteps. Since the matrix is binary you are only interested in duplicate columnindices.