5

I have searched around a bit for tutorials etc. to help with this problem but cant seem to find anything.

I have two lists of n-dimensional numpy arrays (3D array form of some images) and am wanting to check for overlapping images within each list. Lets says list a is a training set and list b is a validation set. One solution is just to use a nested loop and check if each pair of arrays is equal using np.array(a[i], b[j]) but that is slow (each list has about 200,000 numpy arrays in it) and frankly quite disgusting.

I was thinking a more elegant way of achieving this would be to hash each of the numpy arrays within each list and then compare each entry using these hash tables.

Firstly, is this solution correct, and secondly, how would I go about achieving this?
An example of some of the data is below.

train_dataset[:3]
array([[[-0.5       , -0.49607843, -0.5       , ..., -0.5       ,
         -0.49215686, -0.5       ],
        [-0.49607843, -0.47647059, -0.5       , ..., -0.5       ,
         -0.47254902, -0.49607843],
        [-0.49607843, -0.49607843, -0.5       , ..., -0.5       ,
         -0.49607843, -0.49607843],
        ..., 
        [-0.49607843, -0.49215686, -0.5       , ..., -0.5       ,
         -0.49215686, -0.49607843],
        [-0.49607843, -0.47647059, -0.5       , ..., -0.5       ,
         -0.47254902, -0.49607843],
        [-0.5       , -0.49607843, -0.5       , ..., -0.5       ,
         -0.49607843, -0.5       ]],

       [[-0.5       , -0.5       , -0.5       , ...,  0.48823529,
          0.5       ,  0.1509804 ],
        [-0.5       , -0.5       , -0.5       , ...,  0.48431373,
          0.14705883, -0.32745099],
        [-0.5       , -0.5       , -0.5       , ..., -0.32745099,
         -0.5       , -0.49607843],
        ..., 
        [-0.5       , -0.44901961,  0.1509804 , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.49607843, -0.49607843, -0.49215686, ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.49607843, -0.48823529, ..., -0.5       ,
         -0.5       , -0.5       ]],

       [[-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.49607843, ..., -0.5       ,
         -0.5       , -0.5       ],
        ..., 
        [-0.5       , -0.5       , -0.5       , ..., -0.48823529,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ],
        [-0.5       , -0.5       , -0.5       , ..., -0.5       ,
         -0.5       , -0.5       ]]], dtype=float32)

I appreciate the help in advance.

6
  • Show us a bit a what you've done with the pair wise comparison. I don't know what you mean by disgusting in this context. Commented Sep 25, 2016 at 0:30
  • See recent stackoverflow.com/questions/39674863/…. Also look at questions about unique or sorted rows. Commented Sep 25, 2016 at 2:42
  • basically i was trying this: duplicates = [] for i in train_dataset: for j in valid_dataset: duplicates.append(np.equal(i,j) Sorry about the formatting, these comments are weird. Commented Sep 25, 2016 at 3:40
  • How does that np.equal perform with floats? Usually we recommend np.allclose when testing float arrays against each other. (or np.isclose) Commented Sep 25, 2016 at 3:58
  • obviously not very good because I let it run for about 30 minutes and it hadn't finished haha. I'll give those two options a try now. Commented Sep 25, 2016 at 4:06

3 Answers 3

3

The numpy_indexed package (diclaimer: I am its author) has an efficient one-liner for this:

import numpy_indexed as npi
duplicate_images = npi.intersection(train_dataset, test_dataset) 

Plus a lot of related functions which you might find useful in this context.

Sign up to request clarification or add additional context in comments.

2 Comments

Is there a way to make this work with separate X and y (e.g. X_train, y_train, X_test, y_test)? I thought about returning indices like np.intersect1d allows and then check for the intersection in a second pass, but this is unfortunately not (yet?) supported by numpy_indexed.
If I understand you correctly, simply passing in tuples, ie npi.intersection((X_train, y_train), (X_test, y_test)) should accomplish your goal. Perhaps not the best documented feature, but look for LexIndex in the source.
0

It's not so difficult to come up with something:

from collections import defaultdict
import numpy as np

def arrayhash(arr):
    u = arr.view('u' + str(arr.itemsize))
    return np.bitwise_xor.reduce(u.ravel())

def do_the_thing(a, b, hashfunc=arrayhash):
    table = defaultdict(list)
    for i, a_i in enumerate(a):
        table[hashfunc(a_i)].append(i)

    indices = []
    for j, b_j in enumerate(b):
        candidates = table[hashfunc(b_j)]
        for i in candidates:
            if np.array_equiv(a[i], b_j):
                indices.append((i,j))

    return indices

But note:

  • Checking for floating point equality is oftentimes a bad idea, because limited precision and rounding error. Famous example:

    >>> 0.1 + 0.2 == 0.3
    False
    
  • NaN's don't compare equal to themselves:

    >>> np.nan == np.nan
    False
    
  • The simple hash function above regards the bit-representation of floats, but this is problematic in the presence of negative-zero and signaling NaN's.

See also the discussion in this question: Good way to hash a float vector?

Comments

-1

You can find duplicates between arrays using numpy's intersect1d (one dimensional set intersect) function.

duplicate_images = np.intersect1d(train_dataset, test_dataset) 

I timed this using the train and test sets (55000 and 10000 arrays respectively) from one of the tensorflow tutorials which I'm guessing is similar to your data. Using intersect1d, it took about 2.4 seconds to complete on my machine (it took only 1.3 seconds with the parameter assume_unique=True). A pairwise comparison like you describe took several minutes.

EDIT

This answer (above) does not compare each "image" array, as @mbhall88 points out in the comments, this compares elements within the arrays, not the arrays themselves. In order to make sure it compares the arrays you can still use intersect1d but you have to mess around with the dtypes first as explained here. But, the example in that answer deals with 2d arrays and since you're working with 3d arrays you should first flatten the second two dimensions. You should be able to do something like:

def intersect3d(A,B, assume_unique=False):
    # get the original shape of your arrays
    a1d, a2d, a3d = A.shape
    # flatten the 2nd and 3rd dimensions in your arrays
    A = A.reshape((a1d,a2d*a3d))
    B = B.reshape((len(B),a2d*a3d))
    # define a structured dtype so you can treat your arrays as single "element"
    dtype=(', '.join([str(A.dtype)]*ncols))
    # find the duplicate elements
    C = np.intersect1d(A.view(dtype), B.view(dtype), assume_unique=assume_unique)
    # reshape the result and return
    return C.view(A.dtype).reshape(-1, ncols).reshape((len(C),a2d,a3d))

2 Comments

The issue with this approach is that it finds elements within the arrays that are the same. I am interested in which arrays are the same i.e all elements within the array are exactly the same, not just one element.
You are absolutely right. I believe that this answer shows how to use intersect1d in the way that you need. I will update the answer with a link as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.