0

I'm newbie in Python and Pandas. Could you give me advice how to make next manipulation with DataFrame? I have DataFrame_1:

  id id_name  revenue
0  a  name_a       65
1  a  name_b       65
2  a  name_a       70
3  a  name_b       70
4  a  name_a      121
5  a  name_b      121

and I want to make next DataFrame_2:

  id           id_name  revenue
0  a    name_a, name_b       65
1  a    name_a, name_b       70
2  a    name_a, name_b      121

and then make the next DataFrame_3

  id id_name1 id_name2  revenue
0  a   name_a   name_b       65
1  a   name_a   name_b       70
2  a   name_a   name_b      121

So, I want on the first step combine strings with the same 'revenue', and on the second step break up column 'id_name'.

1
  • What's with the id variable? If you're only grouping on Revenue what do you want to happen in the case that the Revenue is the same but id is different? Commented May 13, 2018 at 22:07

3 Answers 3

2

By using groupby and cumcount create the additional key , then we do unstack

s=df.groupby(['id','id_name']).cumcount()
df['NewId']=s.groupby(s).cumcount()+1
df.set_index(['id','revenue','NewId'])['id_name'].unstack().add_prefix('id_name').reset_index()
Out[137]: 
NewId id  revenue id_name1 id_name2
0      a       65   name_a   name_b
1      a       70   name_a   name_b
2      a      121   name_a   name_b
Sign up to request clarification or add additional context in comments.

Comments

2

This is one solution. The first part is identical to @ALollz, but the second uses a list comprehension after calculating the maximum number of id_names per group.

# groupby to list of id_names
df2 = df.groupby(['id', 'revenue'])['id_name'].apply(list).reset_index()

# copy df2
df3 = df2.copy()

# calculate max number of id_names
lens = max(map(len, df3['id_name'].values))

# split columns
df3[['id_name'+str(i) for i in range(1, lens+1)]] = df2['id_name'].apply(pd.Series)

# drop unsplit column
df3 = df3.drop('id_name', 1)

print(df3)

  id  revenue id_name1 id_name2 id_name3
0  a       65   name_a   name_b      NaN
1  a       70   name_a   name_b      NaN
2  a      121   name_a   name_b   name_c

Comments

1

You can basically achieve the second DataFrame with groupby

df2 = df1.groupby(['id', 'revenue']).id_name.apply(list).reset_index()

  id  revenue           id_name
0  a       65  [name_a, name_b]
1  a       70  [name_a, name_b]
2  a      121  [name_a, name_b]

For the third DataFrame you can just apply pandas.Series to the lists you created above. Here's a solution where you don't need to know how many columns you'll wind up with in the end. It will rename up to 10 properly.

import pandas as pd
import numpy as np

df3 = pd.concat([df2[['id', 'revenue']], df2['id_name'].apply(pd.Series)], axis=1)
df3.rename(columns=dict((item, 'id_name'+str(item+1)) for item in np.arange(0,10,1)), inplace=True)

  id  revenue id_name1 id_name2
0  a       65   name_a   name_b
1  a       70   name_a   name_b
2  a      121   name_a   name_b

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.