2

I have a list containing 2D arrays with same number of rows but different number of columns. I need to create a padded array of arrays with same shape. My current code is below but due to the for loop, the code is very inefficient for large lists. How can I create a padded array more efficiently?

MWE:

import numpy as np


list_length = 552

# Irregular shape arrays in the list
# list_with_irregular_arrays[0].shape = (10, 40)
# .
# .
# list_with_irregular_arrays[100].shape = (10, 60)
list_with_irregular_arrays = [np.random.rand(10, np.random.randint(30, 70)) for _ in range(list_length)] # NOTE: Only to create example data

# Create padded array
num_rows = 10
num_cols = max(arr.shape[1] for arr in list_with_irregular_arrays)
padded_array = np.full((list_length, num_rows, num_cols), np.nan, dtype=np.float32)

for i in range(list_length):
    arr = list_with_irregular_arrays[i]
    padded_array[i, :, :arr.shape[1]] = arr
12
  • 2
    What do you want to do with this? Your options are basically to pad with garbage and that'll have issues in most vectorized processing Commented Sep 8 at 18:47
  • 1
    Agreed, the next step is very important to rule out a hidden-question scenario. Commented Sep 8 at 19:18
  • 1
    A big part of the time (35%) is lost in np.full because of page faults so you need to reuse the output array if possible. Then, a smaller part (~15%) is lost in Numpy overheads so native code (or Numba/Cython) can make this a bit faster. Finally, a non-negligible time is taken by memory accesses that cannot be avoided here unless you do not actually create this (certainly unnecessary) expensive array, but directly merge this computation with the next ones using this array. If you expect a speed up like >5 time, then the later is certainly mandatory. Commented Sep 8 at 21:53
  • 1
    Besides, the input is 64-bit floats while the output is 32-bit ones. Is this intended? It matters for performance, especially here since 64-bit floats takes twice more time to read from memory (and also results in conversion). Commented Sep 8 at 21:55
  • 1
    You ask to create 2D array, though you create 3D array. Will it suit you just to np.hstack(list_with_irregular_arrays) to perform your later computations? Commented Sep 9 at 11:14

1 Answer 1

0

The steps before, during, and after this problem statement are some mix of under-specified and undesirable. Avoid a list representation entirely. Use row-packed form, and if you are memory-constrained and can abide a 0 packing value, then you can use CSR (which will have the second dimension correct, and the first dimension as an implicit product).

import numpy as np
import scipy.sparse

n_rows = 3
n_cols = np.array((4, 9, 1))
n_max_cols = n_cols.max()
n_total_cols = n_cols.sum()

rand = np.random.default_rng(seed=0)
packed = rand.random(size=(n_rows, n_total_cols), dtype=np.float32)

print('If you can use this packed form directly (there are many operations that can), then STOP HERE.')
np.set_printoptions(precision=7)
print(packed)
print()

print(
    "You could use this form if 0 is an acceptable padding value "
    "(you haven't responded to specify)."
)
sparse = scipy.sparse.lil_array((n_rows*n_cols.size, n_max_cols))
x = 0
y = 0
# There are vectorised options to construct this as well; this form is easy to understand.
for width in n_cols:
    xnew = x + width
    ynew = y + n_rows
    sparse[y: ynew, 0: width] = packed[:, x: xnew]
    x = xnew
    y = ynew
csr = sparse.tocsr()
print(csr.toarray())
If you can use this packed form directly (there are many operations that can), then STOP HERE.
[[0.85 0.64 0.51 0.27 0.31 0.04 0.08 0.02 0.18 0.81 0.65 0.91 0.5  0.61]
 [0.97 0.73 0.63 0.54 0.56 0.94 0.28 0.82 0.67 0.   0.39 0.86 0.55 0.03]
 [0.76 0.73 0.85 0.18 0.09 0.86 0.02 0.54 0.08 0.3  0.48 0.42 0.4  0.03]]

You could use this form if 0 is an acceptable padding value (you haven't responded to specify).
[[0.85 0.64 0.51 0.27 0.   0.   0.   0.   0.  ]
 [0.97 0.73 0.63 0.54 0.   0.   0.   0.   0.  ]
 [0.76 0.73 0.85 0.18 0.   0.   0.   0.   0.  ]
 [0.31 0.04 0.08 0.02 0.18 0.81 0.65 0.91 0.5 ]
 [0.56 0.94 0.28 0.82 0.67 0.   0.39 0.86 0.55]
 [0.09 0.86 0.02 0.54 0.08 0.3  0.48 0.42 0.4 ]
 [0.61 0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.03 0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.03 0.   0.   0.   0.   0.   0.   0.   0.  ]]
Sign up to request clarification or add additional context in comments.

2 Comments

This doesn't look like what that want. Their result is 3D but yours is only 2D?
It's true, and unfortunately the closest approximation possible using the scipy sparse module.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.