Numpy: Creation of a variable length sequence from a pandas data-frame

Question

Lets say that I have the following data frame:

df_raw = pd.DataFrame({"person_id": [101, 101, 102, 102, 102, 103], "date": [0, 5, 0, 7, 11, 0], "val1": [99, 11, 22, 33, 44, 22], "val2": [77, 88, 22, 66, 55, 33]})

What I want to achieve is create a 3 dimensional numpy array such that the result should be the following:

np_pros = np.array([[[0, 99, 77], [5, 11, 88]], [[0, 22, 22], [7, 33, 66], [11, 44, 55]], [[0, 22, 33]]])

In other words, the 3D array should have the following shape [unique_ids, None, feature_size]. In my case, the number of unique_ids is 3, the feature size is 3 (all columns except the person_id), and the y column is of variable length and it indicates the number of measurments for a person_id.

I am well aware that I can create an np.zeros((unique_ids, max_num_features, feature_size)) array, populate it and then delete the elements that I don't need but I want something faster. The reason being is that my actual data-frame is huge (roughly [50000, 455]) which will result in a numpy array of roughly [12500, 200, 455].

Looking forward to your answers!

I don't think you can create an array like that, each of the inner array have different sizes, the group size. You could have a list however. — Dani Mesejo
– Dani Mesejo, Commented Jan 10, 2019 at 14:08
@DanielMesejo so what do you suggest? What would be optimal in both memory and complexity? — gorjan
– gorjan, Commented Jan 10, 2019 at 14:10
Thats a good question. After I have the sequences I want to perform bucketing with Tensorflow to dynamically pad the sequences. — gorjan
– gorjan, Commented Jan 10, 2019 at 14:15
Thats why I strictly want to end up with a variable length array (to pad afterwards within a batch). — gorjan
– gorjan, Commented Jan 10, 2019 at 14:16

yatu · Accepted Answer · 2019-01-10 15:48:09Z

2

Here's one way to do it:

ix = np.flatnonzero(df1.person_id != df1.person_id.shift(1))
np.split(df1.drop('person_id', axis=1).values, ix[1:])

[array([[ 0, 99, 77],
        [ 5, 11, 88]], dtype=int64), 
 array([[ 0, 22, 22],
        [ 7, 33, 66],
        [11, 44, 55]], dtype=int64), 
 array([[ 0, 22, 33]], dtype=int64)]

Details

Use np.flatnonzero after comparing df1 with a shifted version of itself (pd.shift) in order to get the indices where changes in person_id take place:

ix = np.flatnonzero(df1.person_id != df1.person_id.shift(1))
#array([0, 2, 5])

Use np.split in order to split the dataframe's columns of interest according to the obtained index:

np.split(df1.drop('person_id', axis=1).values, ix[1:])

[array([[ 0, 99, 77],
        [ 5, 11, 88]], dtype=int64), 
 array([[ 0, 22, 22],
        [ 7, 33, 66],
        [11, 44, 55]], dtype=int64), 
 array([[ 0, 22, 33]], dtype=int64)]

edited Jan 10, 2019 at 15:48

answered Jan 10, 2019 at 14:59

yatu

88.6k12 gold badges93 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dani Mesejo · Accepted Answer · 2019-01-10 15:06:23Z

2

You could use groupby:

import pandas as pd

df_raw = pd.DataFrame({"person_id": [101, 101, 102, 102, 102, 103], "date": [0, 5, 0, 7, 11, 0], "val1": [99, 11, 22, 33, 44, 22], "val2": [77, 88, 22, 66, 55, 33]})

result = [group.values for _, group in df_raw.groupby('person_id')[['date', 'val1', 'val2']]]
print(result)

Output

[array([[  0, 101,  99,  77],
       [  5, 101,  11,  88]]), array([[  0, 102,  22,  22],
       [  7, 102,  33,  66],
       [ 11, 102,  44,  55]]), array([[  0, 103,  22,  33]])]

answered Jan 10, 2019 at 15:06

Dani Mesejo

62.2k6 gold badges56 silver badges86 bronze badges

Comments

keepAlive · Accepted Answer · 2019-01-10 16:05:31Z

Another solution with xarray

Let's create the dimension implied by the duplicity of person_id

>>> df['newdim'] = df.person_id.duplicated()
>>> df.newdim    = df.groupby('person_id').newdim.cumsum()
>>> df           = df.set_index(["newdim", "person_id"])
>>> df
                  date  val1  val2
newdim person_id                  
0.0    101           0    99    77
1.0    101           5    11    88
0.0    102           0    22    22
1.0    102           7    33    66
2.0    102          11    44    55
0.0    103           0    22    33

For the sake of readability, we may want to turn df into an xarray.Dataset-object

>>> xa = df.to_xarray()
>>> xa
<xarray.Dataset>
Dimensions:    (newdim: 3, person_id: 3)
Coordinates:
  * newdim     (newdim) float64 0.0 1.0 2.0
  * person_id  (person_id) int64 101 102 103
Data variables:
    date       (newdim, person_id) float64 0.0 0.0 0.0 5.0 7.0 nan nan 11.0 nan
    val1       (newdim, person_id) float64 99.0 22.0 22.0 11.0 33.0 nan nan ...
    val2       (newdim, person_id) float64 77.0 22.0 33.0 88.0 66.0 nan nan ...

and then into a dimensionally-healthy numpy array

>>> ar = xa.to_array().T.values
>>> ar
array([[[ 0., 99., 77.],
        [ 5., 11., 88.],
        [nan, nan, nan]],

       [[ 0., 22., 22.],
        [ 7., 33., 66.],
        [11., 44., 55.]],

       [[ 0., 22., 33.],
        [nan, nan, nan],
        [nan, nan, nan]]])

Note that nan-values have been introduced by coercion.

Collectives™ on Stack Overflow

Numpy: Creation of a variable length sequence from a pandas data-frame

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related