0

I have two arrays - one contains all data points and the other contains some sample of data points. I would like to get boolean arrays (later to be used as indices) that reveal whether each sample point is contained in the original array of all data points. I am trying to use an approach that will work regardless of the array dimensions used. I have successfully done this, but would like to use a simpler (ie, vectorized without for-loop) approach. A brief example is below:

## ALL DATA POINTS: (xi, yi, zi)
p1 = np.array([1, 2, 3])
p2 = np.array([4, 5, 6])
p3 = np.array([2, 3, 4])
p4 = np.array([7, 8, 5])
points = np.array([p1, p2, p3, p4])

## SAMPLE OF DATA (xi, yi, zi)
s1 = np.array([1, 2, 3])
s2 = np.array([4, 6, 5])
s3 = np.array([7, 8, 9])
samples = np.array([s1, s2, s3])

So the data looks like:

print("\nDATA POINTS ({}):\n{}\n".format(points.shape, points))
print("\nSAMPLE POINTS ({}):\n{}\n".format(samples.shape, samples))

DATA POINTS ((4, 3)):
[[1 2 3]
 [4 5 6]
 [2 3 4]
 [7 8 5]]


SAMPLE POINTS ((3, 3)):
[[1 2 3]
 [4 6 5]
 [7 8 9]]

So, the point (1, 2, 3) is the first data point and first sample point, and so on. The function below uses a for-loop to determine if the sample points are contained in the original dataset.

f2 = lambda points, samples : np.array([sample == points for sample in samples])
ans2 = f2(points, samples)

The resulting boolean array looks like this:

for sample, arr in zip(samples, ans2):
    print("\n-- SAMPLE POINT: {}\n".format(sample))
    print("\n .. CONTAINMENT ARRAY ({}):\n{}\n".format(arr.shape, arr))
    res = np.all(arr, axis=1)
    print("\n .. POINTS CONTAINED ({}):\n{}\n".format(res.shape, res))


-- SAMPLE POINT: [1 2 3]


 .. CONTAINMENT ARRAY ((4, 3)):
[[ True  True  True]
 [False False False]
 [False False False]
 [False False False]]


 .. POINTS CONTAINED ((4,)):
[ True False False False]


-- SAMPLE POINT: [4 6 5]


 .. CONTAINMENT ARRAY ((4, 3)):
[[False False False]
 [ True False False]
 [False False False]
 [False False  True]]


 .. POINTS CONTAINED ((4,)):
[False False False False]


-- SAMPLE POINT: [7 8 9]


 .. CONTAINMENT ARRAY ((4, 3)):
[[False False False]
 [False False False]
 [False False False]
 [ True  True False]]


 .. POINTS CONTAINED ((4,)):
[False False False False]

This result is correct.

However, I think there must be a simpler method to achieve this result. I have looked at numpy.isin; however, the results are not identical. Below is my attempt:

f1 = lambda points, samples : np.isin(samples, points)
ans1 = f1(points, samples)

This result looks like:

print("\n*- ANS 1 ({}):\n{}\n".format(ans1.shape, ans1))

*- ANS 1 ((3, 3)):
[[ True  True  True]
 [ True  True  True]
 [ True  True False]]

From this result, I can see that the array checks for the values of 4, 5, and 6 without regard for their respective placements in the array, which is why True is returned for each element of the second row.

How can I modify this approach or start anew to check if each sub-array of sample points is contained in the array of all data points in a simpler way?

3
  • What exactly is the final output that you are looking for? Commented May 1, 2019 at 5:48
  • @Divakar Given the arrays points and samples in the first example (at the top), I would like the output to be [True, False, False] - True for sample point s1, and False for sample points s2 and s3. Commented May 1, 2019 at 5:50
  • 1
    For the simplest : (samples[:,None]==points).all(2).any(1). For perf - stackoverflow.com/questions/54791950/… Commented May 1, 2019 at 6:02

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.