Convert column of lists to 2D numpy array

Question

I'm doing some operations on a Pandas dataframe. For a certain column, I need to convert each cell to a numpy array which is not hard. The end goal is to get a 2D array as a result from the whole column. However, when I perform the following operation, I got a 1D array, and the inner arrays are not recognized.

df = pd.DataFrame({'col': ['abc', 'def']})
mapping = {v: k for k, v in enumerate('abcdef')}
df['new'] = df['col'].apply(lambda x: list(x))
df['new'].apply(lambda x: np.array([mapping[i] for i in x])).values

This gives:

array([array([0, 1, 2]), array([3, 4, 5])], dtype=object)

and the shape is (2,), meaning the inner arrays are not recognized.

If I do s.reshape(2,-1), I got (2,1) instead of (2,3) for the shape.

Appreciate any help!

Clarification:

The above is only a toy example. What I was doing was preprocessing for machine learning using the IMDB dataset. I had to convert each value in a review column to a word embedding which is a numpy array. Now the challenge is to get all these arrays out as a 2D array, so that I can use them in my machine learning model.

np.array(df['new'].values.tolist()) or np.stack(df['new']) — user3483203
– user3483203, Commented Jan 16, 2019 at 21:01
@roganjosh not sure what you mean. If you leave out the tolist, you will get an array of type object with a shape of (2,) — user3483203
– user3483203, Commented Jan 16, 2019 at 21:05
@user3483203 but still a numpy array, that you can try (if in a suitable state) to convert the type of. tolist() drops it out to a python list, which you're just going to convert back to an array? You could just leave it at .values? Or am I missing something — roganjosh
– roganjosh, Commented Jan 16, 2019 at 21:06
@George are you looking for a nested array within a pandas cell? — roganjosh
– roganjosh, Commented Jan 16, 2019 at 21:07

cs95 · Accepted Answer · 2019-01-16 21:18:15Z

8

I think it would be better to create an array from the list values directly.

 df
   col        new
0  abc  [a, b, c]
1  def  [d, e, f]

arr = np.array(df['new'].tolist())
arr
# array([['a', 'b', 'c'],
#        ['d', 'e', 'f']], dtype='<U1')

arr.shape
# (2, 3)

Big disclaimer: This will work only if the sublists all have the same number of elements. If not, it will mean they are ragged arrays, and numpy will not be able to use an efficient memory format for representing your array (hence, the dtype='object').

answered Jan 16, 2019 at 21:18

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

hpaulj Over a year ago

concatenate (or one its stack daughters) will treat a 1d object array as a list, and attempt to join the subarrays into one.

cs95 Over a year ago

@hpaulj I'guessing it would be a lot slower though, since concatenate is working with an object array, right?

hpaulj Over a year ago

We need to do some timings :)

cs95 Over a year ago

@hpaulj Does np.stack(df[['new']].values, axis=1) give you a 2D array of shape (2,3)? It doesn't seem to work, they remain lists for me.

hpaulj Over a year ago

axis=0 is the version that replicates np.array.

hpaulj · Accepted Answer · 2019-01-17 06:02:38Z

In [2]: import pandas as pd
In [3]: df = pd.DataFrame({'col': ['abc', 'def']})
   ...: mapping = {v: k for k, v in enumerate('abcdef')}
   ...: df['new'] = df['col'].apply(lambda x: list(x))

In [7]: df['new']
Out[7]: 
0    [a, b, c]
1    [d, e, f]
Name: new, dtype: object
In [8]: df['new'].values
Out[8]: array([list(['a', 'b', 'c']), list(['d', 'e', 'f'])], dtype=object)

np.stack behaves a lot like np.array, joining the elements on a new initial axis:

In [9]: np.stack(df['new'].values)
Out[9]: 
array([['a', 'b', 'c'],
       ['d', 'e', 'f']], dtype='<U1')

or on another axis you your choice:

In [10]: np.stack(df['new'].values, axis=1)
Out[10]: 
array([['a', 'd'],
       ['b', 'e'],
       ['c', 'f']], dtype='<U1')

np.array also works if the object array is turned into a list (as @coldspeed shows):

In [11]: df['new'].values.tolist()
Out[11]: [['a', 'b', 'c'], ['d', 'e', 'f']]
In [12]: np.array(df['new'].values.tolist())
Out[12]: 
array([['a', 'b', 'c'],
       ['d', 'e', 'f']], dtype='<U1')

As for speed, lets make a bigger array:

In [16]: arr = np.frompyfunc(lambda x: np.arange(1000),1,1)(np.arange(1000))
In [17]: arr.shape
Out[17]: (1000,)
In [18]: np.stack(arr).shape
Out[18]: (1000, 1000)
In [20]: np.array(arr.tolist()).shape
Out[20]: (1000, 1000)

In [21]: timeit np.stack(arr).shape
5.24 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [22]: timeit np.array(arr.tolist()).shape
4.45 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Basically the same, with a slight edge to the np.array approach.

stack like vstack expands the dimensions of each element as needed. Skipping that with a concatenate is a bit faster:

In [27]: timeit np.concatenate(arr).reshape(-1,1000).shape
4.04 ms ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This arr contains arrays. If it contained lists instead the array(arr.tolist()) approach does better (relatively), since it has only one list (of lists) to convert to array. The stack approach has to first convert each of the sublists into arrays.

Collectives™ on Stack Overflow

Convert column of lists to 2D numpy array

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related