4

Lets say that I have the following data frame:

df_raw = pd.DataFrame({"person_id": [101, 101, 102, 102, 102, 103], "date": [0, 5, 0, 7, 11, 0], "val1": [99, 11, 22, 33, 44, 22], "val2": [77, 88, 22, 66, 55, 33]})

What I want to achieve is create a 3 dimensional numpy array such that the result should be the following:

np_pros = np.array([[[0, 99, 77], [5, 11, 88]], [[0, 22, 22], [7, 33, 66], [11, 44, 55]], [[0, 22, 33]]])

In other words, the 3D array should have the following shape [unique_ids, None, feature_size]. In my case, the number of unique_ids is 3, the feature size is 3 (all columns except the person_id), and the y column is of variable length and it indicates the number of measurments for a person_id.

I am well aware that I can create an np.zeros((unique_ids, max_num_features, feature_size)) array, populate it and then delete the elements that I don't need but I want something faster. The reason being is that my actual data-frame is huge (roughly [50000, 455]) which will result in a numpy array of roughly [12500, 200, 455].

Looking forward to your answers!

7
  • I don't think you can create an array like that, each of the inner array have different sizes, the group size. You could have a list however. Commented Jan 10, 2019 at 14:08
  • @DanielMesejo so what do you suggest? What would be optimal in both memory and complexity? Commented Jan 10, 2019 at 14:10
  • What do you want to do afterwards? Commented Jan 10, 2019 at 14:13
  • Thats a good question. After I have the sequences I want to perform bucketing with Tensorflow to dynamically pad the sequences. Commented Jan 10, 2019 at 14:15
  • Thats why I strictly want to end up with a variable length array (to pad afterwards within a batch). Commented Jan 10, 2019 at 14:16

3 Answers 3

2

Here's one way to do it:

ix = np.flatnonzero(df1.person_id != df1.person_id.shift(1))
np.split(df1.drop('person_id', axis=1).values, ix[1:])

[array([[ 0, 99, 77],
        [ 5, 11, 88]], dtype=int64), 
 array([[ 0, 22, 22],
        [ 7, 33, 66],
        [11, 44, 55]], dtype=int64), 
 array([[ 0, 22, 33]], dtype=int64)]

Details

Use np.flatnonzero after comparing df1 with a shifted version of itself (pd.shift) in order to get the indices where changes in person_id take place:

ix = np.flatnonzero(df1.person_id != df1.person_id.shift(1))
#array([0, 2, 5])

Use np.split in order to split the dataframe's columns of interest according to the obtained index:

np.split(df1.drop('person_id', axis=1).values, ix[1:])

[array([[ 0, 99, 77],
        [ 5, 11, 88]], dtype=int64), 
 array([[ 0, 22, 22],
        [ 7, 33, 66],
        [11, 44, 55]], dtype=int64), 
 array([[ 0, 22, 33]], dtype=int64)]
Sign up to request clarification or add additional context in comments.

Comments

2

You could use groupby:

import pandas as pd

df_raw = pd.DataFrame({"person_id": [101, 101, 102, 102, 102, 103], "date": [0, 5, 0, 7, 11, 0], "val1": [99, 11, 22, 33, 44, 22], "val2": [77, 88, 22, 66, 55, 33]})

result = [group.values for _, group in df_raw.groupby('person_id')[['date', 'val1', 'val2']]]
print(result)

Output

[array([[  0, 101,  99,  77],
       [  5, 101,  11,  88]]), array([[  0, 102,  22,  22],
       [  7, 102,  33,  66],
       [ 11, 102,  44,  55]]), array([[  0, 103,  22,  33]])]

Comments

0

Another solution with xarray


Let's create the dimension implied by the duplicity of person_id

>>> df['newdim'] = df.person_id.duplicated()
>>> df.newdim    = df.groupby('person_id').newdim.cumsum()
>>> df           = df.set_index(["newdim", "person_id"])
>>> df
                  date  val1  val2
newdim person_id                  
0.0    101           0    99    77
1.0    101           5    11    88
0.0    102           0    22    22
1.0    102           7    33    66
2.0    102          11    44    55
0.0    103           0    22    33

For the sake of readability, we may want to turn df into an xarray.Dataset-object

>>> xa = df.to_xarray()
>>> xa
<xarray.Dataset>
Dimensions:    (newdim: 3, person_id: 3)
Coordinates:
  * newdim     (newdim) float64 0.0 1.0 2.0
  * person_id  (person_id) int64 101 102 103
Data variables:
    date       (newdim, person_id) float64 0.0 0.0 0.0 5.0 7.0 nan nan 11.0 nan
    val1       (newdim, person_id) float64 99.0 22.0 22.0 11.0 33.0 nan nan ...
    val2       (newdim, person_id) float64 77.0 22.0 33.0 88.0 66.0 nan nan ...

and then into a dimensionally-healthy numpy array

>>> ar = xa.to_array().T.values
>>> ar
array([[[ 0., 99., 77.],
        [ 5., 11., 88.],
        [nan, nan, nan]],

       [[ 0., 22., 22.],
        [ 7., 33., 66.],
        [11., 44., 55.]],

       [[ 0., 22., 33.],
        [nan, nan, nan],
        [nan, nan, nan]]])

Note that nan-values have been introduced by coercion.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.