0

I have an array that was created from lists of varying lengths. I do not know the length of the lists beforehand which is why I was using lists instead of arrays.

Here's a reproducible code for the purpose of this question:

a = []

for i in np.arange(5):
    a += [np.random.rand(np.random.randint(1,6))]

a = np.array(a)

Is there a more efficient way, than the following, to convert this array into a well structured array with the rows being the same size with NaNs?

max_len_of_array = 0
for aa in a:
    len_of_array = aa.shape[0]
    if len_of_array > max_len_of_array:
        max_len_of_array = len_of_array
max_len_of_array

n = a.shape[0]

A = np.zeros((n, max_len_of_array)) * np.nan
for i, aa in enumerate(zip(a)):
    A[i][:aa[0].shape[0]] = aa[0]

A
5
  • 1
    Can you keep track of max_len_of_array when you are filling the original list? Otherwise your approach seems reasonable. Commented Sep 17, 2017 at 23:48
  • @nalyd88 yes it is possible but I am creating around 10 such arrays. I guess I could use an array for the max_len_of_array. Commented Sep 17, 2017 at 23:54
  • @DYZ I don't see how this relates to my question. Please clarify if you do. Commented Sep 18, 2017 at 0:03
  • 1
    Here's a related question. Commented Sep 18, 2017 at 0:32
  • Just as a warning, around here we often use structured array for an array with a compound dtype. What you want is a nan padded rectangular or regular numeric (float) array. Without the padding np.array(yourlist) would produce a 1d object dtype array (an 'irregular` or ragged array). Commented Sep 18, 2017 at 3:21

2 Answers 2

2

Here is a slightly faster version of your code:

def alt(a):
    A = np.full((len(a), max(map(len, a))), np.nan)
    for i, aa in enumerate(a):
        A[i, :len(aa)] = aa
    return A

The for-loops are unavoidable. Given that a is a Python list, there is no getting around the need to iterate through the items in the list. Sometimes the loop can be hidden (behind calls to max and map for instance) but speed-wise they are essentially equivalent to Python loops.


Here is a benchmark using a with resultant shape (100, 100):

In [197]: %timeit orig(a)
10000 loops, best of 3: 125 µs per loop

In [198]: %timeit alt(a)
10000 loops, best of 3: 84.1 µs per loop

In [199]: %timeit using_pandas(a)
100 loops, best of 3: 4.8 ms per loop

This was the setup used for the benchmark:

import numpy as np
import pandas as pd

def make_array(h, w):
    a = []
    for i in np.arange(h):
        a += [np.random.rand(np.random.randint(1,w+1))]
    a = np.array(a)
    return a

def orig(a):
    max_len_of_array = 0

    for aa in a:
        len_of_array = aa.shape[0]
        if len_of_array > max_len_of_array:
            max_len_of_array = len_of_array

    n = a.shape[0]

    A = np.zeros((n, max_len_of_array)) * np.nan
    for i, aa in enumerate(zip(a)):
        A[i][:aa[0].shape[0]] = aa[0]

    return A

def alt(a):
    A = np.full((len(a), max(map(len, a))), np.nan)
    for i, aa in enumerate(a):
        A[i, :len(aa)] = aa
    return A

def using_pandas(a):
    return pd.DataFrame.from_records(a).values

a = make_array(100,100)
Sign up to request clarification or add additional context in comments.

Comments

0

I suppose you can use pandas as a one-time solution, but it's going to be very inefficient, like everything pandas:

pd.DataFrame(a)[0].apply(pd.Series).values
#array([[ 0.28669545,  0.22080038,  0.32727194],
#       [ 0.17892276,         nan,         nan],
#       [ 0.26853548,         nan,         nan],
#       [ 0.86460043,  0.78827094,  0.96660502],
#       [ 0.41045599,         nan,         nan]])

1 Comment

That seems to be another possible solution but as you indicate it is not efficient, at least no more efficient than the loop. 870 microseconds for pandas versus 7.1 microseconds for the loop.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.