Checking column-wise if elements in an array exist in another array

Question

So I have two arrays that look like below:

x1 = np.array([['a','b','c'],['d','a','b'],['c','a,c','c']])
x2 = np.array(['d','c','d'])

I want to check if each element in x2 exists in a corresponding column in x1. So I tried:

print((x1==x2).any(axis=0))
#array([ True, False, False])

Note that x2[1] in x1[2,1] == True. The problem is, sometimes an element we're looking for is inside an element in x1 (where it can be identified if we split by comma). So my desired output is:

array([ True,  True, False])

Is there a way to do it using a numpy (or pandas) native method?

Does substring contains instead of == work? Finding entries containing a substring in a numpy array?. Like (np.core.defchararray.find(x1, x2) != -1).any(axis=0) Or does the comma need to be split into separate elements that need tested separately? — Henry Ecker
– Henry Ecker ♦, Commented Sep 19, 2021 at 17:41
What do expect to happen with this string: 'a,c' Is that a typo, of do you really want to consider that as two different characters? Because I would say neither 'a' nor 'c' exists in that column and you should try to clean your data up first. Also, why is your desired result for the third column False — it contains 'c', which is in x2. — Mark
– Mark, Commented Sep 19, 2021 at 17:42
@Mark, no it’s not a typo; I want to consider both ‘a’ and ‘c’ as two separate characters. — user7864386
– user7864386, Commented Sep 19, 2021 at 17:44
It makes little sense to me to have a string such as 'a,c' in an array to represent the two separate characters 'a' and 'c'. I would suggest to have them as separate items in the array. If you run into array shape issues you could fill up the smaller arrays with nans — Andre
– Andre, Commented Sep 19, 2021 at 18:23

tdy · Accepted Answer · 2021-09-20 23:30:43Z

You can vectorize a function to broadcast x2 in x1.split(','):

@np.vectorize
def f(a, b):
    return b in a.split(',')

f(x1, x2).any(axis=0)
# array([ True,  True, False])

^{Note that "vectorize" is a misnomer. This isn't true vectorization, just a convenient way to broadcast a custom function.}

Since you mentioned pandas in parentheses, another option is to apply a splitting/membership function to the columns of df = pd.DataFrame(x1).

However, the numpy function is significantly faster:

f(x1, x2).any(axis=0)         # 24.2 µs ± 2.8 µs
df.apply(list_comp).any()     # 913 µs ± 12.1 µs
df.apply(combine_in).any()    # 1.8 ms ± 104 µs
df.apply(expand_eq_any).any() # 3.28 ms ± 751 µs

# use a list comprehension to do the splitting and membership checking:
def list_comp(col):
    return [x2[col.name] in val.split(',') for val in col]

# split the whole column and use `combine` to check `x2 in x1`
def combine_in(col):
    return col.str.split(',').combine(x2[col.name], lambda a, b: b in a)

# split the column into expanded columns and check the expanded rows for matches
def expand_eq_any(col):
    return col.str.split(',', expand=True).eq(x2[col.name]).any(axis=1)

Collectives™ on Stack Overflow

Checking column-wise if elements in an array exist in another array

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related