Match array of strings to 2D array

Question

I have a 1D array of strings (gene_name_list). I need to find rows in another 2D array (fully_split) where each string of the first array is present. Of course I can solve it brute force like that:

longest_gene_name = len(max(gene_name_list, key=len))
ensembl_list = np.full((len(gene_name_list)), '', dtype='U{}'.format(longest_gene_name))
for idx, gene_name in enumerate(gene_name_list):  
    for row in fully_split:                       
        if gene_name in row:                      
            ensembl_list[idx] = row[0]

But it takes too long, I need a faster solution.

row[0] contains special symbols that I am mapping to. So, if a string is found, it will be found in row[1:] portion, and then I am taking row[0]. Not relevant, but to clarify.

Matt Eding · Accepted Answer · 2019-11-01 00:15:44Z

1

Based on your description I am making a couple assumptions:
- The 2d array is rectangular (i.e. not dtype=object) since NumPy performance would be useless otherwise.
- len(fully_split) == len(gene_name_list) since your code example has ensembl_list[idx] = row[0] and idx is derived from gene_name_list

>>> gene_name_list = np.array('a bb c d eee'.split())

>>> fully_split = np.array([
...     'id1 a bb c d eee'.split(), # yes
...     'id2 f g hh iii j'.split(),
...     'id3 kk ll a nn o'.split(), # yes
...     'id4 q rr c t eee'.split(), # yes
...     'id5 v www xx y z'.split()
... ])

>>> longest_gene_name = len(max(gene_name_list, key=len))

>>> dtype = 'U{}'.format(longest_gene_name)

>>> ensembl_list = np.zeros_like(gene_name_list, dtype=dtype)

>>> mask = np.isin(fully_split, gene_name_list).any(axis=1)

>>> ensembl_list[mask] = fully_split[mask, 0]

>>> ensembl_list
array(['id1', '', 'id3', 'id4', ''], dtype='<U3')

answered Nov 1, 2019 at 0:15

Matt Eding

1,0322 gold badges9 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nikita Vlasenko Over a year ago

Yeah, this is the answer. The only problem is that np.isin is not present in earlier numpy versions and I am using python 2.7. Makes sense to rewrite it then with in1d somehow...

Matt Eding Over a year ago

Looking at the source code for np.isin it is pretty much a 1-liner to implement (assuming the invert parameter introduced in v1.8.0 is available with python 2.7).

Valentino · Accepted Answer · 2019-10-31 21:36:36Z

Execution time apart, I don't think the brute force method you posted corresponds to what you describe in words:

I need to find rows in another 2D array where each string of the first array is present.

Your code at best finds all the rows there at least one of the strings of the 1D array is present in the row of the 2D array.

The following code does what you asked in words using regex.

import re

pattern = r'*'.join(map(re.escape, np.sort(gene_name_list)))
rows = [''.join(np.sort(x)) for x in fully_split]
res = [re.search(pattern, r) for r in rows]

Since order is not relevant, the gene_name_list is lexicographically sorted and strings are concatenated using the regex special char '*' as delimitator. This is the pattern which will be searched.
Then each row of the 2D array fully_split is again lexicographically sorted and the strings joined to form a single string. A regex search is performed on each row to check if there is a match.

res is a list, you get None for those rows where a match is not found, and the corresponding MatchObject is a match is found.

This illustrates the concept. To be closer to your expected result (where you store the first element of the row) replace the last line with:

res = [l[0] if re.search(pattern, r) else None for r, l in zip(rows, fully_split)]

Collectives™ on Stack Overflow

Match array of strings to 2D array

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related