Split lists in a dataframe with different length lists in columns and rows

Question

So my question is similar to this one. Mine is different because I have different length lists on the same row and columns.

Many of the solutions I have tried produce a very long dataframe with multiple repeats. My requirements are row wise meaning if a row has a list it is split into the required number of rows but it does not lead to multiple repeats. Please see below example.

Input example

import pandas as pd
df = pd.DataFrame(
{'C1': [["A","B"], ["C"], ["D","E"], ["F"]],
 'C2': [[1], [2], [3], [4]],
 'C3': ['s1', 's2', 's3', 's4'],
 'C4': [123, 321, [777,111], 145]})

df

Desired output example

I have been playing around with explode(), reset_index(), drop() and more but have not been able to get anything to give the correct output yet.

One thing I tried was this

df = df.explode("C1").reset_index().drop("index",1).explode("C4").reset_index().drop("index",1)

But output was wrong

Henry Ecker · Accepted Answer · 2021-09-20 23:01:58Z

It seems that the exploded columns and the non-exploded columns need to be separated. Since we can't hide them in the index as we normally do (given C2) contains lists (which are unhashable) we must separate the DataFrame then rejoin.

# Convert to single series to explode
cols = ['C1', 'C4']
new_df = df[cols].stack().explode().to_frame()
# Enumerate groups then unstack
new_df = new_df.set_index(
    new_df.groupby(level=[0, 1]).cumcount(),
    append=True
).unstack(1).groupby(level=0).ffill()

# Join Back Unaffected columns
new_df = new_df.droplevel(0, axis=1).droplevel(1, axis=0).join(
    df[df.columns.symmetric_difference(cols)]
)
# Re order columns and reset index
new_df = new_df.reindex(df.columns, axis=1).reset_index(drop=True)

new_df:

  C1   C2  C3   C4
0  A  [1]  s1  123
1  B  [1]  s1  123
2  C  [2]  s2  321
3  D  [3]  s3  777
4  E  [3]  s3  111
5  F  [4]  s4  145

We stack to get all values into a single series then explode together and convert back to_frame

cols = ['C1', 'C4']
new_df = df[cols].stack().explode().to_frame()

new_df

We can create a new index by enumerating groups with groupby cumcount set_index and unstacking:

new_df = new_df.set_index(
    new_df.groupby(level=[0, 1]).cumcount(),
    append=True
).unstack(1)

     0     
    C1   C4
0 0  A  123
  1  B  NaN
1 0  C  321
2 0  D  777
  1  E  111
3 0  F  145

We can then groupby ffill within index groups:

new_df = new_df.groupby(level=0).ffill()

new_df:

     0     
    C1   C4
0 0  A  123
  1  B  123
1 0  C  321
2 0  D  777
  1  E  111
3 0  F  145

We can then join back the unaffected columns to the DataFrame and reindex to reorder them the way the initially appeared also droplevel to remove unneeded index levels, lastly reset_index:

# Join Back Unaffected columns
new_df = new_df.droplevel(0, axis=1).droplevel(1, axis=0).join(
    df[df.columns.symmetric_difference(cols)]
)
# Re order columns and reset index
new_df = new_df.reindex(df.columns, axis=1).reset_index(drop=True)

new_df:

  C1   C2  C3   C4
0  A  [1]  s1  123
1  B  [1]  s1  123
2  C  [2]  s2  321
3  D  [3]  s3  777
4  E  [3]  s3  111
5  F  [4]  s4  145

It's a bit cumbersome. But like you mentioned in my 1.3.0 update post, this is not a well supported operation in pandas.

wwnde · Accepted Answer · 2021-09-21 00:09:01Z

-1

 df=df.explode('C4').assign(C1=df['C1'].str.join(',').str.split(',')).explode('C1')#Explode to expand dataframe
m=df.duplicated(subset='C1', keep=False)#loc select the duplicated

df.loc[m,'C4']=df.loc[m,'C4'].shift(1)#Introduce Nan

df.dropna().drop_duplicates(subset='C1', keep='last')#clean dataframe

Output

   C1   C2  C3   C4
0  A  [1]  s1  123
0  B  [1]  s1  123
1  C  [2]  s2  321
2  D  [3]  s3  777
2  E  [3]  s3  111
3  F  [4]  s4  145

edited Sep 21, 2021 at 0:09

answered Sep 20, 2021 at 22:53

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

2 Comments

Derek O Over a year ago

OP doesn't want repeated values in column C1

Cam Over a year ago

Looks like this one also works. thanks

Collectives™ on Stack Overflow

Split lists in a dataframe with different length lists in columns and rows

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related