0

I am struggling with trying to split my dataframe into 3 new dataframes where the split occurs when the supplier name changes. I have searched for existing questions. How to split dataframe on based on columns row and Pandas & python: split dataframe into many dataframes based on column value containing substring came close but I cannot get the output that I want.

Here is attached a toy dataset to illustrate my problem:

df = pd.DataFrame({'Supplier': ['Supplier1', 'Supplier1', 'Supplier2', 'Supplier2', 'Supplier2', 'Supplier3','Supplier3'], 'Class' : ['A', 'A','A','A','A','B','B']})

and I tried (unsuccessfully)

df1 = df.iloc[:df.index[df['Supplier'] == 'Supplier1'].tolist()[0]]
df2 = df.iloc[df.index[df['Supplier'] == 'Supplier2'].tolist()[0]+1:]
df3 = df.iloc[df.index[df['Supplier'] == 'Supplier3'].tolist()[0]+1:]

The result that I am trying to achieve is:

   Supplier Class
0  Supplier1     A
1  Supplier1     A
    Supplier Class
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
    Supplier Class
0  Supplier3     B
1  Supplier3     B

Any help on this would be greatly appreciated. Thank you!

UPDATE: Using:

df1 = {i:group for i,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift()).cumsum() )}

gives:

{1:     Supplier Class
0  Supplier1     A
1  Supplier1     A, 2:     Supplier Class
2  Supplier2     A
3  Supplier2     A
4  Supplier2     A, 3:     Supplier Class
5  Supplier3     B
6  Supplier3     B}

I need to split into separated dataframe so I did:

df3 = pd.DataFrame.from_dict({i:group for i,group in df1.groupby(df1['Supplier'].ne(df1['Supplier'].shift()).cumsum() )},orient='index', columns= ['Class'])

but it gives the error

 df3 = pd.DataFrame.from_dict({i:group for i,group in df1.groupby(df1['Supplier'].ne(df1['Supplier'].shift()).cumsum() )},orient='index', columns= ['Class'])
AttributeError: 'dict' object has no attribute 'groupby'

3 Answers 3

3

to create dataframe for unique supplier values:

dict(zip(df.groupby('Supplier')))

to create dataframe every time the value in the supplier column changes:

dfs = {i:group.reset_index(drop=True) 
       for i,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift()).cumsum() )}

Update

obtaining three separate dataframes is incompatible with wanting to use pd.DataFrame(..) in the end, this will obviously create a unique dataframe, therefore my solution was to create a dataframe dictionary where each one is accessed for an integer value of 1 to n. We can reset the index for each of them, simply by doing:

{i:group.reset_index(drop=True) for i,group in df.groupby( df['supplier'].ne(df['supplier'].shift()).cumsum() )}

we can obtain a single data frame with the index restored every time there is a change in value in the provider column using pd.concat as @anky_91 suggested

dfs_concat = pd.concat([group.reset_index(drop=True) 
                        for _,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift())
                                                                 .cumsum() )])
print(dfs_concat)

    Supplier Class
0  Supplier1     A
1  Supplier1     A
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
0  Supplier3     B
1  Supplier3     B

but if the latter is the solution sought we could simply use groupby.cumcount

df.index = df.groupby(df['Supplier'].ne(df['Supplier'].shift()).cumsum()).cumcount()
print(df)

  Supplier Class
0  Supplier1     A
1  Supplier1     A
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
0  Supplier3     B
1  Supplier3     B
Sign up to request clarification or add additional context in comments.

3 Comments

this second does the work! but gives a dictionary, pd.DataFrame.from_dict does not work
could you show the output of pd.DataFrame.from_dict editing your question? thanks !:)
another one matching the index which OP shows is pd.concat([i.reset_index(drop=True) for _,i in df.groupby('Supplier')]) , but not sure why do they have the columns for each iteration as a row
0

I believe this achieves the splitting you want:

groups = [group.reset_index()[['Supplier', 'Class']] for _, group in df.groupby('Supplier')]

You can get the exact output of your example with

for group in groups:
    print(group)

Output:

    Supplier Class
0  Supplier1     A
1  Supplier1     A
    Supplier Class
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
    Supplier Class
0  Supplier3     B
1  Supplier3     B

1 Comment

This does not provide any new ideas, nor does it give the desired output because you have not set drop = True
0

Try this,

df = pd.DataFrame({'Supplier': ['Supplier1', 'Supplier1', 'Supplier2', 'Supplier2', 'Supplier2', 'Supplier3','Supplier3'], 'Class' : ['A', 'A','A','A','A','B','B']})


df1 = df[df.Supplier=='Supplier1']
df2 = df[df.Supplier=='Supplier2']
df3 = df[df.Supplier=='Supplier3']

Or you can do,

new_df=df.pivot(columns='Supplier')

to get many columns if you have 'suppliers'.

Output:

Supplier Supplier1 Supplier2 Supplier3
0                A       NaN       NaN
1                A       NaN       NaN
2              NaN         A       NaN
3              NaN         A       NaN
4              NaN         A       NaN
5              NaN       NaN         B
6              NaN       NaN         B

1 Comment

imagine that there are 100 supplier values

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.