5

I'm doing some operations on a Pandas dataframe. For a certain column, I need to convert each cell to a numpy array which is not hard. The end goal is to get a 2D array as a result from the whole column. However, when I perform the following operation, I got a 1D array, and the inner arrays are not recognized.

df = pd.DataFrame({'col': ['abc', 'def']})
mapping = {v: k for k, v in enumerate('abcdef')}
df['new'] = df['col'].apply(lambda x: list(x))
df['new'].apply(lambda x: np.array([mapping[i] for i in x])).values

This gives:

array([array([0, 1, 2]), array([3, 4, 5])], dtype=object)

and the shape is (2,), meaning the inner arrays are not recognized.

If I do s.reshape(2,-1), I got (2,1) instead of (2,3) for the shape.

Appreciate any help!


Clarification:

The above is only a toy example. What I was doing was preprocessing for machine learning using the IMDB dataset. I had to convert each value in a review column to a word embedding which is a numpy array. Now the challenge is to get all these arrays out as a 2D array, so that I can use them in my machine learning model.

9
  • 1
    np.array(df['new'].values.tolist()) or np.stack(df['new']) Commented Jan 16, 2019 at 21:01
  • @user3483203 tolist() will mean it's no longer an array Commented Jan 16, 2019 at 21:03
  • @roganjosh not sure what you mean. If you leave out the tolist, you will get an array of type object with a shape of (2,) Commented Jan 16, 2019 at 21:05
  • @user3483203 but still a numpy array, that you can try (if in a suitable state) to convert the type of. tolist() drops it out to a python list, which you're just going to convert back to an array? You could just leave it at .values? Or am I missing something Commented Jan 16, 2019 at 21:06
  • @George are you looking for a nested array within a pandas cell? Commented Jan 16, 2019 at 21:07

2 Answers 2

8

I think it would be better to create an array from the list values directly.

 df
   col        new
0  abc  [a, b, c]
1  def  [d, e, f]

arr = np.array(df['new'].tolist())
arr
# array([['a', 'b', 'c'],
#        ['d', 'e', 'f']], dtype='<U1')

arr.shape
# (2, 3)

Big disclaimer: This will work only if the sublists all have the same number of elements. If not, it will mean they are ragged arrays, and numpy will not be able to use an efficient memory format for representing your array (hence, the dtype='object').

Sign up to request clarification or add additional context in comments.

5 Comments

concatenate (or one its stack daughters) will treat a 1d object array as a list, and attempt to join the subarrays into one.
@hpaulj I'guessing it would be a lot slower though, since concatenate is working with an object array, right?
We need to do some timings :)
@hpaulj Does np.stack(df[['new']].values, axis=1) give you a 2D array of shape (2,3)? It doesn't seem to work, they remain lists for me.
axis=0 is the version that replicates np.array.
0
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({'col': ['abc', 'def']})
   ...: mapping = {v: k for k, v in enumerate('abcdef')}
   ...: df['new'] = df['col'].apply(lambda x: list(x))

In [7]: df['new']
Out[7]: 
0    [a, b, c]
1    [d, e, f]
Name: new, dtype: object
In [8]: df['new'].values
Out[8]: array([list(['a', 'b', 'c']), list(['d', 'e', 'f'])], dtype=object)

np.stack behaves a lot like np.array, joining the elements on a new initial axis:

In [9]: np.stack(df['new'].values)
Out[9]: 
array([['a', 'b', 'c'],
       ['d', 'e', 'f']], dtype='<U1')

or on another axis you your choice:

In [10]: np.stack(df['new'].values, axis=1)
Out[10]: 
array([['a', 'd'],
       ['b', 'e'],
       ['c', 'f']], dtype='<U1')

np.array also works if the object array is turned into a list (as @coldspeed shows):

In [11]: df['new'].values.tolist()
Out[11]: [['a', 'b', 'c'], ['d', 'e', 'f']]
In [12]: np.array(df['new'].values.tolist())
Out[12]: 
array([['a', 'b', 'c'],
       ['d', 'e', 'f']], dtype='<U1')

As for speed, lets make a bigger array:

In [16]: arr = np.frompyfunc(lambda x: np.arange(1000),1,1)(np.arange(1000))
In [17]: arr.shape
Out[17]: (1000,)
In [18]: np.stack(arr).shape
Out[18]: (1000, 1000)
In [20]: np.array(arr.tolist()).shape
Out[20]: (1000, 1000)

In [21]: timeit np.stack(arr).shape
5.24 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [22]: timeit np.array(arr.tolist()).shape
4.45 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Basically the same, with a slight edge to the np.array approach.

stack like vstack expands the dimensions of each element as needed. Skipping that with a concatenate is a bit faster:

In [27]: timeit np.concatenate(arr).reshape(-1,1000).shape
4.04 ms ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This arr contains arrays. If it contained lists instead the array(arr.tolist()) approach does better (relatively), since it has only one list (of lists) to convert to array. The stack approach has to first convert each of the sublists into arrays.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.