How to efficiently create an array from a list containing arrays of different lengths

Question

I have a list containing 2D arrays with same number of rows but different number of columns. I need to create a padded array of arrays with same shape. My current code is below but due to the for loop, the code is very inefficient for large lists. How can I create a padded array more efficiently?

MWE:

import numpy as np


list_length = 552

# Irregular shape arrays in the list
# list_with_irregular_arrays[0].shape = (10, 40)
# .
# .
# list_with_irregular_arrays[100].shape = (10, 60)
list_with_irregular_arrays = [np.random.rand(10, np.random.randint(30, 70)) for _ in range(list_length)] # NOTE: Only to create example data

# Create padded array
num_rows = 10
num_cols = max(arr.shape[1] for arr in list_with_irregular_arrays)
padded_array = np.full((list_length, num_rows, num_cols), np.nan, dtype=np.float32)

for i in range(list_length):
    arr = list_with_irregular_arrays[i]
    padded_array[i, :, :arr.shape[1]] = arr

What do you want to do with this? Your options are basically to pad with garbage and that'll have issues in most vectorized processing — roganjosh
– roganjosh, Commented Sep 8 at 18:47
Agreed, the next step is very important to rule out a hidden-question scenario. — Reinderien
– Reinderien, Commented Sep 8 at 19:18
A big part of the time (35%) is lost in np.full because of page faults so you need to reuse the output array if possible. Then, a smaller part (~15%) is lost in Numpy overheads so native code (or Numba/Cython) can make this a bit faster. Finally, a non-negligible time is taken by memory accesses that cannot be avoided here unless you do not actually create this (certainly unnecessary) expensive array, but directly merge this computation with the next ones using this array. If you expect a speed up like >5 time, then the later is certainly mandatory. — Jérôme Richard
– Jérôme Richard, Commented Sep 8 at 21:53
Besides, the input is 64-bit floats while the output is 32-bit ones. Is this intended? It matters for performance, especially here since 64-bit floats takes twice more time to read from memory (and also results in conversion). — Jérôme Richard
– Jérôme Richard, Commented Sep 8 at 21:55
You ask to create 2D array, though you create 3D array. Will it suit you just to np.hstack(list_with_irregular_arrays) to perform your later computations? — SLebedev777
– SLebedev777, Commented Sep 9 at 11:14

Reinderien · Accepted Answer · 2025-09-09 15:23:40Z

0

The steps before, during, and after this problem statement are some mix of under-specified and undesirable. Avoid a list representation entirely. Use row-packed form, and if you are memory-constrained and can abide a 0 packing value, then you can use CSR (which will have the second dimension correct, and the first dimension as an implicit product).

import numpy as np
import scipy.sparse

n_rows = 3
n_cols = np.array((4, 9, 1))
n_max_cols = n_cols.max()
n_total_cols = n_cols.sum()

rand = np.random.default_rng(seed=0)
packed = rand.random(size=(n_rows, n_total_cols), dtype=np.float32)

print('If you can use this packed form directly (there are many operations that can), then STOP HERE.')
np.set_printoptions(precision=7)
print(packed)
print()

print(
    "You could use this form if 0 is an acceptable padding value "
    "(you haven't responded to specify)."
)
sparse = scipy.sparse.lil_array((n_rows*n_cols.size, n_max_cols))
x = 0
y = 0
# There are vectorised options to construct this as well; this form is easy to understand.
for width in n_cols:
    xnew = x + width
    ynew = y + n_rows
    sparse[y: ynew, 0: width] = packed[:, x: xnew]
    x = xnew
    y = ynew
csr = sparse.tocsr()
print(csr.toarray())

If you can use this packed form directly (there are many operations that can), then STOP HERE.
[[0.85 0.64 0.51 0.27 0.31 0.04 0.08 0.02 0.18 0.81 0.65 0.91 0.5  0.61]
 [0.97 0.73 0.63 0.54 0.56 0.94 0.28 0.82 0.67 0.   0.39 0.86 0.55 0.03]
 [0.76 0.73 0.85 0.18 0.09 0.86 0.02 0.54 0.08 0.3  0.48 0.42 0.4  0.03]]

You could use this form if 0 is an acceptable padding value (you haven't responded to specify).
[[0.85 0.64 0.51 0.27 0.   0.   0.   0.   0.  ]
 [0.97 0.73 0.63 0.54 0.   0.   0.   0.   0.  ]
 [0.76 0.73 0.85 0.18 0.   0.   0.   0.   0.  ]
 [0.31 0.04 0.08 0.02 0.18 0.81 0.65 0.91 0.5 ]
 [0.56 0.94 0.28 0.82 0.67 0.   0.39 0.86 0.55]
 [0.09 0.86 0.02 0.54 0.08 0.3  0.48 0.42 0.4 ]
 [0.61 0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.03 0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.03 0.   0.   0.   0.   0.   0.   0.   0.  ]]

edited Sep 9 at 15:23

answered Sep 9 at 14:16

Reinderien

16.7k9 gold badges56 silver badges92 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Kelly Bundy Sep 9 at 14:40

This doesn't look like what that want. Their result is 3D but yours is only 2D?

Reinderien Sep 9 at 15:24

It's true, and unfortunately the closest approximation possible using the scipy sparse module.

Collectives™ on Stack Overflow

How to efficiently create an array from a list containing arrays of different lengths

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related