How to split single column of pandas dataframe into multiple columns with group?

Question

I am new to python pandas. I have one dataframe like below:

df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
               'age': ['25', '22','21','32','37','26','24','30']})
print df

       Name age
0  football  25
1    ramesh  22
2    suresh  21
3    pankaj  32
4   cricket  37
5    rakesh  26
6     mohit  24
7    mahesh  30

"Name" column contains "sports name" and "sport person name" also. I want to split it into two different columns like below:

Expected Output:

sports_name sport_person_name age
football    ramesh            25
            suresh            22
            pankaj            32
cricket     rakesh            26
            mohit             24
            mahesh            30

If I make groupby on "Name" column I'm not getting expected output and it is obviously straight-forward output because no duplicates in "Name" column. What I need to use so that I can get expected output?

Edit : If don't want to hardcode the sports names

df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
           'age': ['', '22','21','32','','26','24','30']})

df = df.replace('', np.nan, regex=True)

nan_rows = df[df.isnull().T.any().T]
sports = nan_rows['Name'].tolist()

df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)

I Just Checked for except "Name" column which rows contains NAN values in all rest of the columns and It will be definitely sports names. I created list of that sports names and make use of below solutions to create sports_name and sports_person_name columns.

jezrael · Accepted Answer · 2017-09-11 06:53:12Z

2

You can use:

#define list of sports
sports = ['football','cricket']
#create NaNs if no sport in Name, forward filling NaNs
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
#remove same values in columns sports_name and Name, rename column
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
#change order of columns
df = df[['sports_name','sport_person_name','age']]
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1    football            suresh  21
2    football            pankaj  32
3     cricket            rakesh  26
4     cricket             mohit  24
5     cricket            mahesh  30

Similar solution with DataFrame.insert - then reorder is not necessary:

#define list of sports
sports = ['football','cricket']
#rename column by dict
d = {'Name':'sport_person_name'}
df = df.rename(columns=d)
#create NaNs if no sport in Name, forward filling NaNs
df.insert(0, 'sports_name', df['sport_person_name'].where(df['sport_person_name'].isin(sports)).ffill())
#remove same values in columns sports_name and Name
df = df[df['sports_name'] != df['sport_person_name']].reset_index(drop=True)
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1    football            suresh  21
2    football            pankaj  32
3     cricket            rakesh  26
4     cricket             mohit  24
5     cricket            mahesh  30

If want only one value of sport add limit=1 to ffill and replace NaNs to empty string:

sports = ['football','cricket']
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill(limit=1).fillna('')
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1                        suresh  21
2                        pankaj  32
3     cricket            rakesh  26
4                         mohit  24
5                        mahesh  30

edited Sep 11, 2017 at 6:53

answered Sep 11, 2017 at 5:58

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

ketan Over a year ago

@jezrael- Thanks for your answer. If I don't want to hardcode sports name in the code. If it will change dynamically then what will be the choice in our hand?

jezrael Over a year ago

Hmmm, it is problematic. Because how is logic for find sports from names ?

ketan Over a year ago

yes see the actual question of mine. Its a pivot table. Give me some hint if you have any idea about pivot table reading in pandas. stackoverflow.com/questions/46154843/…?

jezrael Over a year ago

Maybe some possible solution is create some big list of all exist sports from wikipedia - but not sure if match with all your data.

ketan Over a year ago

@jazrael- No It is not productive. It will not work for me.

|

Drza loren · Accepted Answer · 2017-09-11 06:56:16Z

1

The output you want is a dictionary and not a dataframe. The dictionary will look:

{'Sport' : {'Player' : age,'Player2' : age}}

If you really want a dataframe: If the name always comes before the players:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Name': ['football','ramesh','suresh','pankaj','cricket' 
                  ,'rakesh','mohit','mahesh'],
                  'age': ['25', '22','21','32','37','26','24','30']})

sports=['football', 'cricket']
wanted_dict={}
current_sport=''

for val in df['sport_person_name']:
    if val in sports:
        current_sport=val
    else:
        wanted_dict[val]=current_sport

#Now you got - {name:sport_name,...}

df['sports_name']=999
for val in df['sport_person_name']
    df['sports_name']=np.where((val not in sports)&
                              (df['sport_person_name']==val),
                               wanted_dict[val],'sport)

df = df[df['sports_name']!='sport']

What it should look like:

sports_name sport_person_name age
football    ramesh            25
football    suresh            22
football    pankaj            32
cricket     rakesh            26
cricket     mohit             24
cricket     mahesh            30

answered Sep 11, 2017 at 6:56

Drza loren

1031 gold badge3 silver badges10 bronze badges

5 Comments

ketan Over a year ago

@Drzaloren- Thanks for your answer. If I don't want to hardcode sports name in the code. If it will change dynamically then what will be the choice in our hand?

Drza loren Over a year ago

Well if you have a sport and than the same amount of people maby you can use the Index to creat the sports list, i see you have an age column so what will be written for a sport? If the values for a sport is a NaN you can try using it.

ketan Over a year ago

@Drzaloren- Yes you are correct. see correct dataframe this one - df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'], 'age': ['', '22','21','32','','26','24','30']}) but don't harcode age column is null codition because I this dataframe have 100 more columns which are null for sports row.

Drza loren Over a year ago

Not sure i understand. If the age will always be null for a sport and 'int' for a player you can do this: Df2=df Df2.fillna(999) Df2=df[df['age']==999] And than take df2['Name'] as your sport list

ketan Over a year ago

@Drzaloren- Please see the Edit in the question. Thanks for the quick hint.

Collectives™ on Stack Overflow

How to split single column of pandas dataframe into multiple columns with group?

2 Answers 2

8 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related