Clustering 2D numpy arrays into smaller 4D numpy arrays

Question

Dear Stackoverflow users,

My python script is facing performance issues because I have to crunch 2D tables with more than 1 billion elements for potential lists of hundreds of input files. I've been replacing my nested for loops by numpy array manipulation callings and in this process, I found that numpy.take (which finds elements according to a set of indices) and numpy.outer (which evaluates all possible products between two 1D array elements) are extraordinarily useful. These functions allowed me to multiply the performance of my code by several hundreds where I could use them.

But there is still a place in my code where I have a problem, and this place is where I cluster my 2D array with say a billion elements into a 4D array with far fewer elements (like, some thousands). Specifically, I have two lists of indices to which the size is equal to the number of rows of the matrix (which is a square matrix).

The first list of indices is th_t, the second list is dm_t, and the matrix is p_contact. The 4D array of clustered elements is named rc_p. The clustering procedure is the following nested for loop:

import numpy as np

th_t = [1, 3, 2, 1, 1, 3, 3, 0, 1, 0, 2, 1]
dm_t = [0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0]
n_th = len(set(th_t))
n_dm = len(set(dm_t))
p_contact = [[0.0129, 0.0134, 0.0062, 0.0021, 0.0107, 0.0106, 0.0076, 0.0134, 0.0087, 0.0031, 0.0026, 0.0114]
[0.0123, 0.0021, 0.0033, 0.0120, 0.0099, 0.0125, 0.0001, 0.0018, 0.0030, 0.0059, 0.0038, 0.0125]
[0.0082, 0.0125, 0.0004, 0.0120, 0.0040, 0.0108, 0.0101, 0.0063, 0.0072, 0.0098, 0.0017, 0.0121]
[0.0096, 0.0008, 0.0073, 0.0100, 0.0123, 0.0104, 0.0077, 0.0025, 0.0106, 0.0126, 0.0031, 0.0033]
[0.0112, 0.0091, 0.0134, 0.0002, 0.0129, 0.0081, 0.0087, 0.0036, 0.0102, 0.0002, 0.0019, 0.0131]
[0.0099, 0.0081, 0.0037, 0.0004, 0.0135, 0.0005, 0.0025, 0.0086, 0.0091, 0.0016, 0.0130, 0.0011]
[0.0078, 0.0005, 0.0044, 0.0089, 0.0127, 0.0106, 0.0113, 0.0048, 0.0057, 0.0133, 0.0077, 0.0033]
[0.0017, 0.0010, 0.0048, 0.0052, 0.0113, 0.0066, 0.0133, 0.0092, 0.0020, 0.0125, 0.0011, 0.0023]
[0.0027, 0.0124, 0.0096, 0.0047, 0.0134, 0.0020, 0.0129, 0.0114, 0.0087, 0.0114, 0.0090, 0.0001]
[0.0032, 0.0014, 0.0038, 0.0114, 0.0058, 0.0017, 0.0089, 0.0057, 0.0022, 0.0056, 0.0046, 0.0094]
[0.0033, 0.0020, 0.0042, 0.0040, 0.0110, 0.0016, 0.0100, 0.0014, 0.0087, 0.0123, 0.0004, 0.0031]
[0.0010, 0.0029, 0.0054, 0.0015, 0.0064, 0.0060, 0.0131, 0.0064, 0.0073, 0.0097, 0.0132, 0.0092]]
n_sg = len(p_contact)
rc_p = np.zeros((n_th, n_dm, n_th, n_dm)) 
for i in range(n_sg): #n_sg can be about 40000
        for j in range(n_sg):
            rc_p[th_t[i]][dm_t[i]][th_t[j]][dm_t[j]] += p_contact[i][j]

I tried to use various numpy functions to avoid this nested for loop over a billion elements, and I ended up with the following procedure:

import numpy as np

th_t = [1, 3, 2, 1, 1, 3, 3, 0, 1, 0, 2, 1]
dm_t = [0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0]
n_th = len(set(th_t))
n_dm = len(set(dm_t))
p_contact = [[0.0129, 0.0134, 0.0062, 0.0021, 0.0107, 0.0106, 0.0076, 0.0134, 0.0087, 0.0031, 0.0026, 0.0114]
[0.0123, 0.0021, 0.0033, 0.0120, 0.0099, 0.0125, 0.0001, 0.0018, 0.0030, 0.0059, 0.0038, 0.0125]
[0.0082, 0.0125, 0.0004, 0.0120, 0.0040, 0.0108, 0.0101, 0.0063, 0.0072, 0.0098, 0.0017, 0.0121]
[0.0096, 0.0008, 0.0073, 0.0100, 0.0123, 0.0104, 0.0077, 0.0025, 0.0106, 0.0126, 0.0031, 0.0033]
[0.0112, 0.0091, 0.0134, 0.0002, 0.0129, 0.0081, 0.0087, 0.0036, 0.0102, 0.0002, 0.0019, 0.0131]
[0.0099, 0.0081, 0.0037, 0.0004, 0.0135, 0.0005, 0.0025, 0.0086, 0.0091, 0.0016, 0.0130, 0.0011]
[0.0078, 0.0005, 0.0044, 0.0089, 0.0127, 0.0106, 0.0113, 0.0048, 0.0057, 0.0133, 0.0077, 0.0033]
[0.0017, 0.0010, 0.0048, 0.0052, 0.0113, 0.0066, 0.0133, 0.0092, 0.0020, 0.0125, 0.0011, 0.0023]
[0.0027, 0.0124, 0.0096, 0.0047, 0.0134, 0.0020, 0.0129, 0.0114, 0.0087, 0.0114, 0.0090, 0.0001]
[0.0032, 0.0014, 0.0038, 0.0114, 0.0058, 0.0017, 0.0089, 0.0057, 0.0022, 0.0056, 0.0046, 0.0094]
[0.0033, 0.0020, 0.0042, 0.0040, 0.0110, 0.0016, 0.0100, 0.0014, 0.0087, 0.0123, 0.0004, 0.0031]
[0.0010, 0.0029, 0.0054, 0.0015, 0.0064, 0.0060, 0.0131, 0.0064, 0.0073, 0.0097, 0.0132, 0.0092]]

#prepare the flattened list of index pairs
th_t        = np.asarray(th_t)
dm_t        = np.asarray(dm_t)
thdm_stack  = np.stack((th_t, dm_t))
thdm_stack  = np.transpose(thdm_stack)
thdm_table  = np.asarray(list(product(thdm_stack, thdm_stack)))
p_contact_f = p_contact.flatten()

#calculate clustered probabilities for each contact type
rc_p                 = np.zeros((n_th, n_dm, n_th, n_dm)) 

for th1 in range(n_th):
    for dm1 in range(n_dm):
        for th2 in range(n_th):
            for dm2 in range(n_dm):
                to_find                       = np.zeros((2, 2))
                to_find[0][0]                 = th1
                to_find[0][1]                 = dm1
                to_find[1][0]                 = th2
                to_find[1][1]                 = dm2
                condition                     = np.isin(thdm_table, to_find)
                condition                     = np.all(condition, axis=(1, 2))
                to_add                        = np.extract(condition, p_contact_f)
                rc_p[th1][dm1][th2][dm2]      = np.sum(to_add)

which ends up being slower than the original procedure, not faster, probably because I have to generate a 1-billion sized boolean matrix and treat it at each of the thousands steps of the 4D for loop (which has thousands less elements than the initial for loop, just to remind).

So, does any of you have some idea about how I could replace this costly nested for loop and take maximum advantage of the underlying C-code of numpy to cluster this big 2D matrix into the much smaller 4D array?

Note that the individual elements in these arrays are probabilities. The total sum of all elements in the 2D array and in the 4D clustered array is 1, and by "clustering", I mean grouping probabilities into types (all cells of the 2D matrix which display identical sets of indices get their probabilities added into one of the 4D clustered array elements).

All the best!

Can you add a minimal reproducible example? Basically we need some test arrays and an expected output, even if it's way smaller than your actual use case, so that we know what the output should be. — Daniel F
– Daniel F, Commented Sep 27, 2018 at 6:32
Finally, do you need a numpy only solution, or can we use scipy as well? — Daniel F
– Daniel F, Commented Sep 27, 2018 at 6:34
Ok thanks for the comment. I'm okay with a scipy solution. The only thing I want to do is to accelerate this piece of code thanks to the possibilities of projecting python code onto underlying C code provided by any library that would do this. I will make the example minimal complete and verifiable right now. — Glxblt76
– Glxblt76, Commented Sep 27, 2018 at 7:10

Daniel F · Accepted Answer · 2018-09-28 05:45:07Z

1

You're not actually iterating over four dimensions, you're iterating over 2: i and j. You can np.ravel_multi_index together your th_t and dm_t arrays to reduce the problem to 2d, and reshape it back to 4d at the end:

idx = np.ravel_multi_index((th_t, dm_t), (n_th, n_dm))
rc_p = np.zeros((n_th * n_dm, n_th * n_dm))
for i in range (idx.size):
    np.add.at(rc_p[idx[i]], idx, p_contact[i])
rc_p = rc_p.reshape(n_th, n_dm, n_th, n_dm)

Or, if you can use numba, just wrap your initial loopy code in a @jit, which will c-compile it

from numba import jit

@jit
def foo(p_contact, th_t, dm_t, n_th, n_dm):
    n_sg = len(p_contact)
    rc_p = np.zeros((n_th, n_dm, n_th, n_dm)) 
    for i in range(n_sg): 
        for j in range(n_sg):
            rc_p[th_t[i]][dm_t[i]][th_t[j]][dm_t[j]] += p_contact[i][j]

edited Sep 28, 2018 at 5:45

answered Sep 27, 2018 at 10:21

Daniel F

14.5k2 gold badges34 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Glxblt76 Over a year ago

Thanks! I used the first snippet of code, and it worked (and led to a large gain of time for my code) when I replaced rc_p.reshape(n_th, n_dm, n_th, n_dm) by rc_p = rc_p.reshape(n_th, n_dm, n_th, n_dm) in my script.

Glxblt76 · Accepted Answer · 2018-09-28 02:07:55Z

0

I want to emphasize the very useful function proposed by Daniel F which I did not know and was key to fix this issue:

numpy.ravel_multi_index

It can transform index sequences into 1D index list. For example, with an index pair based on two lists of 2 and 9 indices, the 1,4 index outputted by this numpy function is the 14th index (9 indices plus 5). It is a bit tricky to understand, but very powerful.

answered Sep 28, 2018 at 2:07

Glxblt76

3956 silver badges15 bronze badges

Collectives™ on Stack Overflow

Clustering 2D numpy arrays into smaller 4D numpy arrays

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related