1

I have two numpy array 2D. What I want to do is to find specific rows of np_weight in the np_sentence.

For example:

#rows are features, columns are clusters or whatever
np_weight = np.random.uniform(1.0,10.0,size=(7,4))
print(np_weight)

[[9.96859395 8.65543961 6.07429382 4.58735497]
 [3.21776471 8.33560037 2.11424961 8.89739975]
 [9.74560314 5.94640798 6.10318198 7.33056421]
 [6.60986206 2.36877835 3.06143215 7.82384351]
 [9.49702267 9.98664568 3.89140374 5.42108704]
 [1.93551346 8.45768507 8.60233715 8.09610975]
 [5.21892795 4.18786508 5.82665674 8.28397111]]

#rows are sentence index, columns are words on that sentence
np_sentence = np.random.randint(0.0,7.0,size=(5,3))
print(np_sentence)

[[2 5 1]
 [1 6 4]
 [0 0 0]
 [2 3 6]
 [4 2 4]]

If I sort np_weight on each column and then get top5 of that, I will have this one (here I just show the first column):

temp_sorted_result=
[9.96859395 ] --->index=0
[9.74560314 ] --→ index=2
[9.49702267 ] --→ index=4
[6.60986206 ] --->index=3
[5.21892795 ] --->index=6

Now, I want to search these indexes two by two in the second numpy array np_sentence to see is there any row on that which contains two of the indexes.

For example, based on this it has to output: 1,3,4. These are the indices of the np_sentence which includes a combination of two of the indexes in temp_sorted_result.

for instance, both 4 and 6 which are available in temp_sorted_result are in the same row of np_sentence in the row=1 and so on.

I need to do this for each column of np_weight. It is very important for me to have a very efficient code as the number of the rows are very large

What I have done so far is only searching one item in the second array which is not what I want ultimately:

One approach could be I form all the combinations for each column, for example for the first column showed above temp_sorted_result, I form

(0,2) (0,4)(0,3) (0,6)
(2,4) (2,3) (2,6)
(4,3)(4,6)
(3,6)

and then check which one is available in the rows of np_sentence. Base on my np_sentence rows index of 1,3,4 contains some of these.

Now my question is that how can I implement this in a most efficient way?

Please let me know if it is not obvious.

I appreciate your help:)

0

1 Answer 1

1

Here is one approach: The function f below creates a mask the same shape as weight (plus one dummy row of Falses) marking the top five entries in each column with True.

It then uses np_sentence to index into the mask and counts the True for each column,row pair and compares with the threshold two.

Only complication: We must suppress duplicate values in rows of np_sentence. To that end we sort the rows and then direct each index which equals its left neighbor to the dummy row in the mask.

This function returns a mask. The last line of the script demonstrates how to convert that mask to indices.

import numpy as np

def f(a1, a2, n_top, n_hit):
    N,M = a1.shape
    mask = np.zeros((N+1,M), dtype=bool)
    np.greater_equal(
        a1,a1[a1.argpartition(N-n_top, axis=0)[N-n_top], np.arange(M)],
        out=mask[:N])
    a2 = np.sort(a2, axis=1)
    a2[:,1:][a2[:,1:]==a2[:,:-1]] = N
    return np.count_nonzero(mask[a2], axis=1) >= n_hit

a1 = np.matrix("""[[9.96859395 8.65543961 6.07429382 4.58735497]
 [3.21776471 8.33560037 2.11424961 8.89739975]
 [9.74560314 5.94640798 6.10318198 7.33056421]
 [6.60986206 2.36877835 3.06143215 7.82384351]
 [9.49702267 9.98664568 3.89140374 5.42108704]
 [1.93551346 8.45768507 8.60233715 8.09610975]
 [5.21892795 4.18786508 5.82665674 8.28397111]]"""[2:-2].replace("]\n [",";")).A

a2 = np.matrix("""[[2 5 1]
 [1 6 4]
 [0 0 0]
 [2 3 6]
 [4 2 4]]"""[2:-2].replace("]\n [",";")).A

print(f(a1,a2,5,2))

from itertools import groupby
from operator import itemgetter

print([[*map(itemgetter(1),grp)] for k,grp in groupby(np.argwhere(f(a1,a2,5,2).T),itemgetter(0))])

Output:

[[False  True  True  True]
 [ True  True  True  True]
 [False False False False]
 [ True False  True  True]
 [ True  True  True False]]
[[1, 3, 4], [0, 1, 4], [0, 1, 3, 4], [0, 1, 3]]
Sign up to request clarification or add additional context in comments.

8 Comments

Thank you so much for putting effort and sharing ur idea with me, that is awesome. I need to check it on the real data which I might have 800K rows. I will ask you question if I faced with something. Again Many thanks, this is wonderfull.
Thanks again, Now I checked it, I was wondering is there any way I can have the index of the np_weight rather the other matrice? All the things are the way I want except I also need the index of the first matrix which has been repeated two by two in the first matrix itself.
I have updated my question with minor modification. Sorry if Im asking you to help with the modification, I truly appreciate your help.
I understand it is too much question in one stack question, but as they are closely related Im not sure it will be good idea starting new question. I want to keep your approach for doing it in most effiecient way. please let me know if you want me to start a new question for the update I made, Thank you again.
@sariii just make a new question and link this one. I may not be able to answer immediately, but maybe some one else does.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.