0

I have a 1D array of strings (gene_name_list). I need to find rows in another 2D array (fully_split) where each string of the first array is present. Of course I can solve it brute force like that:

longest_gene_name = len(max(gene_name_list, key=len))
ensembl_list = np.full((len(gene_name_list)), '', dtype='U{}'.format(longest_gene_name))
for idx, gene_name in enumerate(gene_name_list):  
    for row in fully_split:                       
        if gene_name in row:                      
            ensembl_list[idx] = row[0]

But it takes too long, I need a faster solution.

row[0] contains special symbols that I am mapping to. So, if a string is found, it will be found in row[1:] portion, and then I am taking row[0]. Not relevant, but to clarify.

2 Answers 2

1

Based on your description I am making a couple assumptions:
- The 2d array is rectangular (i.e. not dtype=object) since NumPy performance would be useless otherwise.
- len(fully_split) == len(gene_name_list) since your code example has ensembl_list[idx] = row[0] and idx is derived from gene_name_list

>>> gene_name_list = np.array('a bb c d eee'.split())

>>> fully_split = np.array([
...     'id1 a bb c d eee'.split(), # yes
...     'id2 f g hh iii j'.split(),
...     'id3 kk ll a nn o'.split(), # yes
...     'id4 q rr c t eee'.split(), # yes
...     'id5 v www xx y z'.split()
... ])

>>> longest_gene_name = len(max(gene_name_list, key=len))

>>> dtype = 'U{}'.format(longest_gene_name)

>>> ensembl_list = np.zeros_like(gene_name_list, dtype=dtype)

>>> mask = np.isin(fully_split, gene_name_list).any(axis=1)

>>> ensembl_list[mask] = fully_split[mask, 0]

>>> ensembl_list
array(['id1', '', 'id3', 'id4', ''], dtype='<U3')
Sign up to request clarification or add additional context in comments.

2 Comments

Yeah, this is the answer. The only problem is that np.isin is not present in earlier numpy versions and I am using python 2.7. Makes sense to rewrite it then with in1d somehow...
Looking at the source code for np.isin it is pretty much a 1-liner to implement (assuming the invert parameter introduced in v1.8.0 is available with python 2.7).
1

Execution time apart, I don't think the brute force method you posted corresponds to what you describe in words:

I need to find rows in another 2D array where each string of the first array is present.

Your code at best finds all the rows there at least one of the strings of the 1D array is present in the row of the 2D array.

The following code does what you asked in words using regex.

import re

pattern = r'*'.join(map(re.escape, np.sort(gene_name_list)))
rows = [''.join(np.sort(x)) for x in fully_split]
res = [re.search(pattern, r) for r in rows]

Since order is not relevant, the gene_name_list is lexicographically sorted and strings are concatenated using the regex special char '*' as delimitator. This is the pattern which will be searched.
Then each row of the 2D array fully_split is again lexicographically sorted and the strings joined to form a single string. A regex search is performed on each row to check if there is a match.

res is a list, you get None for those rows where a match is not found, and the corresponding MatchObject is a match is found.

This illustrates the concept. To be closer to your expected result (where you store the first element of the row) replace the last line with:

res = [l[0] if re.search(pattern, r) else None for r, l in zip(rows, fully_split)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.