0

I'm working with a dataframe that contains multiple columns, my goal is to create one extra column which contains a list of the values from the columns and then, explode the dataframe on that new column.

This is the original dataset:

         id  day_a1  day_a2  ...   day_a6
13804  002n    25.0    25.0  ...     25.0
30842  002c    30.0    30.0  ...     30.0
1624   002k    25.0     NaN  ...     25.0
8959   002j    25.0    25.0  ...     25.0
21216  003t    25.0    25.0  ...     25.0

I use df['vector'] = df[['day_a1,'day_a2','day_a3','day_a4','day_a5','day_a6']].astype(str).apply(lambda x: ','.join(axis=1) to create this extra column that should be a list of all the dates for the day columns from 1 to 6.

print(df['vector']) returns the following output:

13804    25.0,25.0,24.0,25.0,25.0,25.0
30842    30.0,30.0,31.0,28.0,31.0,30.0
1624         25.0,nan,nan,nan,nan,25.0
8959     25.0,25.0,25.0,25.0,25.0,25.0

This is not being interpreted as a list, so if try new_df = df.explode('vector') nothing happens.

But also, I've tried using the following to convert the column vector into a list:

def listing(row):
    val = list(row['vector'])
    return val
df['vector_b'] = df.apply(listing,axis=1)

But it also doesn't work, because each row is interpreted as string, hence the list is being created as:

13804    [2, 5, ., 0, ,, 2, 5, ., 0, ,, 2, 4, ., 0, ,, ...
30842    [3, 0, ., 0, ,, 3, 0, ., 0, ,, 3, 1, ., 0, ,, ...
1624     [2, 5, ., 0, ,, n, a, n, ,, n, a, n, ,, n, a, ...

How can I create an extra column with the values of the columns day_a1,day_a2, to day_a6 that will be interpreted as a list to later use explode on?

  • I've tried also using ast.literal_eval() in a custom function and it didn't work because it returned error.
  • I need to use .astype(str) before applying the lambda otherwise I get an error saying string was expected but recieved float.

Thanks.

The expected output would be this:

         id  vector  
13804  002n    25.0 
13804  002n    25.0
       ....    ....
13804  002n    25.0
30842  002c    30.0
30842  002c    30.0
  ...   ...     ...
30842  002c    30.0
1624   002k    25.0
1624   002k     NaN
 ...    ...     ...
1624   002k    25.0
2
  • Instead of astype(str).apply(','.join, axis=1) do astype(str).apply(list, axis=1)? Commented Nov 15, 2019 at 20:09
  • I'll try it right now. Commented Nov 15, 2019 at 20:10

2 Answers 2

2

On the second thought, this might work better for you:

df.set_index('id', append=True).stack()

Output:

       id          
13804  002n  day_a1    25.0
             day_a2    25.0
             day_a6    25.0
30842  002c  day_a1    30.0
             day_a2    30.0
             day_a6    30.0
1624   002k  day_a1    25.0
             day_a6    25.0
8959   002j  day_a1    25.0
             day_a2    25.0
             day_a6    25.0
21216  003t  day_a1    25.0
             day_a2    25.0
             day_a6    25.0
dtype: float64
Sign up to request clarification or add additional context in comments.

2 Comments

Yes both apply(list,axis=1) and this answer provide an expected output, thanks! I'll mark it as answer as soon as I can.
@IvanLibedinsky this is recommended as it is vectorized. It would perform much faster than apply then explode.
1

You could also do:

df[['day_a1','day_a2','day_a3','day_a4','day_a5','day_a6']].apply(lambda x: x.tolist(), axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.