how to hash numpy arrays to check for duplicates

Question

I have searched around a bit for tutorials etc. to help with this problem but cant seem to find anything.

I have two lists of n-dimensional numpy arrays (3D array form of some images) and am wanting to check for overlapping images within each list. Lets says list a is a training set and list b is a validation set. One solution is just to use a nested loop and check if each pair of arrays is equal using np.array(a[i], b[j]) but that is slow (each list has about 200,000 numpy arrays in it) and frankly quite disgusting.

I was thinking a more elegant way of achieving this would be to hash each of the numpy arrays within each list and then compare each entry using these hash tables.

Firstly, is this solution correct, and secondly, how would I go about achieving this?
An example of some of the data is below.

train_dataset[:3]
array([[[-0.5       , -0.49607843, -0.5       , ..., -0.5       ,
         -0.49215686, -0.5       ],
        [-0.49607843, -0.47647059, -0.5       , ..., -0.5       ,
         -0.47254902, -0.49607843],
        [-0.49607843, -0.49607843, -0.5       , ..., -0.5       ,
         -0.49607843, -0.49607843],
        ..., 
        [-0.49607843, -0.49215686, -0.5       , ..., -0.5       ,
         -0.49215686, -0.49607843],
        [-0.49607843, -0.47647059, -0.5       , ..., -0.5       ,
         -0.47254902, -0.49607843],
        [-0.5       , -0.49607843, -0.5       , ..., -0.5       ,
         -0.49607843, -0.5       ]],

       [[-0.5       , -0.5       , -0.5       , ...,  0.48823529,
          0.5       ,  0.1509804 ],
        [-0.5       , -0.5       , -0.5       , ...,  0.48431373,
          0.14705883, -0.32745099],
        [-0.5       , -0.5       , -0.5       , ..., -0.32745099,
         -0.5       , -0.49607843],
        ..., 
        [-0.5       , -0.44901961,  0.1509804 , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.49607843, -0.49607843, -0.49215686, ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.49607843, -0.48823529, ..., -0.5       ,
         -0.5       , -0.5       ]],

       [[-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.49607843, ..., -0.5       ,
         -0.5       , -0.5       ],
        ..., 
        [-0.5       , -0.5       , -0.5       , ..., -0.48823529,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ]]], dtype=float32)

I appreciate the help in advance.

Show us a bit a what you've done with the pair wise comparison. I don't know what you mean by disgusting in this context. — hpaulj
– hpaulj, Commented Sep 25, 2016 at 0:30
See recent stackoverflow.com/questions/39674863/…. Also look at questions about unique or sorted rows. — hpaulj
– hpaulj, Commented Sep 25, 2016 at 2:42
basically i was trying this: duplicates = [] for i in train_dataset: for j in valid_dataset: duplicates.append(np.equal(i,j) Sorry about the formatting, these comments are weird. — Michael Hall
– Michael Hall, Commented Sep 25, 2016 at 3:40
How does that np.equal perform with floats? Usually we recommend np.allclose when testing float arrays against each other. (or np.isclose) — hpaulj
– hpaulj, Commented Sep 25, 2016 at 3:58
obviously not very good because I let it run for about 30 minutes and it hadn't finished haha. I'll give those two options a try now. — Michael Hall
– Michael Hall, Commented Sep 25, 2016 at 4:06

Eelco Hoogendoorn · Accepted Answer · 2016-09-25 19:05:31Z

3

The numpy_indexed package (diclaimer: I am its author) has an efficient one-liner for this:

import numpy_indexed as npi
duplicate_images = npi.intersection(train_dataset, test_dataset)

Plus a lot of related functions which you might find useful in this context.

answered Sep 25, 2016 at 19:05

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Felix Over a year ago

Is there a way to make this work with separate X and y (e.g. X_train, y_train, X_test, y_test)? I thought about returning indices like np.intersect1d allows and then check for the intersection in a second pass, but this is unfortunately not (yet?) supported by numpy_indexed.

Eelco Hoogendoorn Over a year ago

If I understand you correctly, simply passing in tuples, ie npi.intersection((X_train, y_train), (X_test, y_test)) should accomplish your goal. Perhaps not the best documented feature, but look for LexIndex in the source.

Community · Accepted Answer · 2017-05-23 12:01:49Z

It's not so difficult to come up with something:

from collections import defaultdict
import numpy as np

def arrayhash(arr):
    u = arr.view('u' + str(arr.itemsize))
    return np.bitwise_xor.reduce(u.ravel())

def do_the_thing(a, b, hashfunc=arrayhash):
    table = defaultdict(list)
    for i, a_i in enumerate(a):
        table[hashfunc(a_i)].append(i)

    indices = []
    for j, b_j in enumerate(b):
        candidates = table[hashfunc(b_j)]
        for i in candidates:
            if np.array_equiv(a[i], b_j):
                indices.append((i,j))

    return indices

But note:

Checking for floating point equality is oftentimes a bad idea, because limited precision and rounding error. Famous example:
```
>>> 0.1 + 0.2 == 0.3
False
```
NaN's don't compare equal to themselves:
```
>>> np.nan == np.nan
False
```
The simple hash function above regards the bit-representation of floats, but this is problematic in the presence of negative-zero and signaling NaN's.

See also the discussion in this question: Good way to hash a float vector?

Community · Accepted Answer · 2017-05-23 11:53:47Z

-1

You can find duplicates between arrays using numpy's intersect1d (one dimensional set intersect) function.

duplicate_images = np.intersect1d(train_dataset, test_dataset)

I timed this using the train and test sets (55000 and 10000 arrays respectively) from one of the tensorflow tutorials which I'm guessing is similar to your data. Using intersect1d, it took about 2.4 seconds to complete on my machine (it took only 1.3 seconds with the parameter assume_unique=True). A pairwise comparison like you describe took several minutes.

EDIT

This answer (above) does not compare each "image" array, as @mbhall88 points out in the comments, this compares elements within the arrays, not the arrays themselves. In order to make sure it compares the arrays you can still use intersect1d but you have to mess around with the dtypes first as explained here. But, the example in that answer deals with 2d arrays and since you're working with 3d arrays you should first flatten the second two dimensions. You should be able to do something like:

def intersect3d(A,B, assume_unique=False):
    # get the original shape of your arrays
    a1d, a2d, a3d = A.shape
    # flatten the 2nd and 3rd dimensions in your arrays
    A = A.reshape((a1d,a2d*a3d))
    B = B.reshape((len(B),a2d*a3d))
    # define a structured dtype so you can treat your arrays as single "element"
    dtype=(', '.join([str(A.dtype)]*ncols))
    # find the duplicate elements
    C = np.intersect1d(A.view(dtype), B.view(dtype), assume_unique=assume_unique)
    # reshape the result and return
    return C.view(A.dtype).reshape(-1, ncols).reshape((len(C),a2d,a3d))

edited May 23, 2017 at 11:53

CommunityBot

11 silver badge

answered Sep 25, 2016 at 2:21

bunji

5,2331 gold badge19 silver badges39 bronze badges

2 Comments

Michael Hall Over a year ago

The issue with this approach is that it finds elements within the arrays that are the same. I am interested in which arrays are the same i.e all elements within the array are exactly the same, not just one element.

bunji Over a year ago

You are absolutely right. I believe that this answer shows how to use intersect1d in the way that you need. I will update the answer with a link as well.

Collectives™ on Stack Overflow

how to hash numpy arrays to check for duplicates

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related