1

I am new to python pandas. I have one dataframe like below:

df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
               'age': ['25', '22','21','32','37','26','24','30']})
print df

       Name age
0  football  25
1    ramesh  22
2    suresh  21
3    pankaj  32
4   cricket  37
5    rakesh  26
6     mohit  24
7    mahesh  30

"Name" column contains "sports name" and "sport person name" also. I want to split it into two different columns like below:

Expected Output:

sports_name sport_person_name age
football    ramesh            25
            suresh            22
            pankaj            32
cricket     rakesh            26
            mohit             24
            mahesh            30

If I make groupby on "Name" column I'm not getting expected output and it is obviously straight-forward output because no duplicates in "Name" column. What I need to use so that I can get expected output?

Edit : If don't want to hardcode the sports names

df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
           'age': ['', '22','21','32','','26','24','30']})

df = df.replace('', np.nan, regex=True)

nan_rows = df[df.isnull().T.any().T]
sports = nan_rows['Name'].tolist()

df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)

I Just Checked for except "Name" column which rows contains NAN values in all rest of the columns and It will be definitely sports names. I created list of that sports names and make use of below solutions to create sports_name and sports_person_name columns.

2 Answers 2

2

You can use:

#define list of sports
sports = ['football','cricket']
#create NaNs if no sport in Name, forward filling NaNs
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
#remove same values in columns sports_name and Name, rename column
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
#change order of columns
df = df[['sports_name','sport_person_name','age']]
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1    football            suresh  21
2    football            pankaj  32
3     cricket            rakesh  26
4     cricket             mohit  24
5     cricket            mahesh  30

Similar solution with DataFrame.insert - then reorder is not necessary:

#define list of sports
sports = ['football','cricket']
#rename column by dict
d = {'Name':'sport_person_name'}
df = df.rename(columns=d)
#create NaNs if no sport in Name, forward filling NaNs
df.insert(0, 'sports_name', df['sport_person_name'].where(df['sport_person_name'].isin(sports)).ffill())
#remove same values in columns sports_name and Name
df = df[df['sports_name'] != df['sport_person_name']].reset_index(drop=True)
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1    football            suresh  21
2    football            pankaj  32
3     cricket            rakesh  26
4     cricket             mohit  24
5     cricket            mahesh  30

If want only one value of sport add limit=1 to ffill and replace NaNs to empty string:

sports = ['football','cricket']
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill(limit=1).fillna('')
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1                        suresh  21
2                        pankaj  32
3     cricket            rakesh  26
4                         mohit  24
5                        mahesh  30
Sign up to request clarification or add additional context in comments.

8 Comments

@jezrael- Thanks for your answer. If I don't want to hardcode sports name in the code. If it will change dynamically then what will be the choice in our hand?
Hmmm, it is problematic. Because how is logic for find sports from names ?
yes see the actual question of mine. Its a pivot table. Give me some hint if you have any idea about pivot table reading in pandas. stackoverflow.com/questions/46154843/…?
Maybe some possible solution is create some big list of all exist sports from wikipedia - but not sure if match with all your data.
@jazrael- No It is not productive. It will not work for me.
|
1

The output you want is a dictionary and not a dataframe. The dictionary will look:

{'Sport' : {'Player' : age,'Player2' : age}}

If you really want a dataframe: If the name always comes before the players:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Name': ['football','ramesh','suresh','pankaj','cricket' 
                  ,'rakesh','mohit','mahesh'],
                  'age': ['25', '22','21','32','37','26','24','30']})

sports=['football', 'cricket']
wanted_dict={}
current_sport=''

for val in df['sport_person_name']:
    if val in sports:
        current_sport=val
    else:
        wanted_dict[val]=current_sport

#Now you got - {name:sport_name,...}

df['sports_name']=999
for val in df['sport_person_name']
    df['sports_name']=np.where((val not in sports)&
                              (df['sport_person_name']==val),
                               wanted_dict[val],'sport)

df = df[df['sports_name']!='sport']

What it should look like:

sports_name sport_person_name age
football    ramesh            25
football    suresh            22
football    pankaj            32
cricket     rakesh            26
cricket     mohit             24
cricket     mahesh            30

5 Comments

@Drzaloren- Thanks for your answer. If I don't want to hardcode sports name in the code. If it will change dynamically then what will be the choice in our hand?
Well if you have a sport and than the same amount of people maby you can use the Index to creat the sports list, i see you have an age column so what will be written for a sport? If the values for a sport is a NaN you can try using it.
@Drzaloren- Yes you are correct. see correct dataframe this one - df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'], 'age': ['', '22','21','32','','26','24','30']}) but don't harcode age column is null codition because I this dataframe have 100 more columns which are null for sports row.
Not sure i understand. If the age will always be null for a sport and 'int' for a player you can do this: Df2=df Df2.fillna(999) Df2=df[df['age']==999] And than take df2['Name'] as your sport list
@Drzaloren- Please see the Edit in the question. Thanks for the quick hint.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.