Selecting different specific values in dataframe after use replace method

Question

Here's my code:

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import matplotlib as mpl
import numpy as np
import matplotlib.pyplot as plt

#create a list of each year where data will be extract

years_list = [2001, 2002, 2008, 2012, 2015,2018, 2020 , 2021]
player_list = ['Mac Jones', 'Aaron Rodgers', 'Deshaun Watson', 'Patrick Mahomes',
                'Josh Allen', 'Ryan Tannehill', 'Drew Bress', 'Russel Wilson',
                'Kirk Cousins', 'Tom Brady', 'Derek Carr']

#selecting stats
cols = ['Player', 'Tm','Cmp%', 'Yds', 'TD', 'Int', 'Y/A', 'Rate', 'G']
df_list = []

#loop for extract data
for year in years_list:
    url_mac = f'https://www.pro-football-reference.com/years/{year}/passing.htm'
    temp_df = pd.read_html(url_mac)[0][cols]
    temp_df['Season'] = year
    
    df_list.append(temp_df)
    print(f'Collected: {year}')


data_radar = pd.concat(df_list)

#renaming columns
new_columns = data_radar.columns.values
new_columns[-6] = 'y_sack'
data_radar.columns = new_columns

#picking stats
mid_data = pd.DataFrame()
for player in player_list:
    mid_data = mid_data.append(data_radar[data_radar['Player'] == player + '*'])
    mid_data = mid_data.append(data_radar[data_radar['Player'] == player + '*' + '+'])
    mid_data = mid_data.append(data_radar[data_radar['Player'] == player])
    mid_data = mid_data.append(data_radar[data_radar['Player'] == player + '+'])

#relevant stats
cols = ['Cmp%', 'Yds', 'Int', 'Y/A','Rate', 'G', 'Season']
final_data = pd.DataFrame()

#fixing names
mid_data = mid_data.replace({'Tom Brady*':'Tom Brady', 'Aaron Rodgers*':'Aaron Rodgers','Aaron Rodgers*+':'Aaron Rodgers',
                   'Deshaun Watson*':'Deshaun Watson', 'Josh Allen*':'Josh Allen',
                   'Derek Carr*':'Derek Carr','Patrick Mahomes*':'Patrick Mahomes', 'Patrick Mahomes*+':'Patrick Mahomes' })




#Select informations about players and ordering

final_data = mid_data[['Player', 'Tm'] + cols]
final_data.sort_values(by = 'Player', ascending=True)
final_data.drop_duplicates(subset = 'Player')

What i want with that code is that my df final_data returns me first season of each player, but that dont work with some players that i needed use replace method.

Where i write to sort_value that's my result, before drop.duplicates()

My idea was sort these values, then use drop.duplicates() to select just first of each player.

This happen with all players that i needed use replace method. How fix this ?

if you find the solutions worked for you, you should make sure to accept the solution. This includes your previous posts which you still have not accepted a solution for either here — chitown88
– chitown88, Commented Sep 29, 2021 at 12:09
Oh! Thank you. I cast my positive vote, but I didn't see that there is an accept button. — GLVieira
– GLVieira, Commented Sep 29, 2021 at 15:33

chitown88 · Accepted Answer · 2021-09-29 12:07:35Z

1

There's quite a few confusing parts of your code. First, if all you are trying to do is get rid of the '*' and or '+' in the player names, why not just do that as opposed to hard coding each player? Second, your comments don't actually describe what your code is doing. I don't see the point of

#Converting colums from object to floats
cols = ['Cmp%', 'Yds', 'Int', 'Y/A','Rate', 'G', 'Season']
final_data = pd.DataFrame()

as you are not converting to floats, and # picking top 10 qb in rating stats in last season + Mac Jones comment isn't doing what it says either. Very confusing to follow your comments.

Thirdly, if you want the first season of each player, then you need to sort by 'Season', so when you drop duplicates of the player name, you can explicitly say to keep the first entry/row of that player, which wil be their first season in the dataframe if you sorted it.

Try this:

import pandas as pd


#create a list of each year where data will be extract

years_list = [2001, 2002, 2008, 2012, 2015,2018, 2020 , 2021]
player_list = ['Mac Jones', 'Aaron Rodgers', 'Deshaun Watson', 'Patrick Mahomes',
                'Josh Allen', 'Ryan Tannehill', 'Drew Bress', 'Russel Wilson',
                'Kirk Cousins', 'Tom Brady', 'Derek Carr']

#selecting stats
cols = ['Player', 'Tm','Cmp%', 'Yds', 'TD', 'Int', 'Y/A', 'Rate', 'G']
df_list = []

#loop for extract data
for year in years_list:
    url_mac = f'https://www.pro-football-reference.com/years/{year}/passing.htm'
    temp_df = pd.read_html(url_mac)[0][cols]
    temp_df['Season'] = year
    
    temp_df = temp_df[temp_df['Player'] != 'Player']
    
    df_list.append(temp_df)
    print(f'Collected: {year}')
data_radar = pd.concat(df_list)


#renaming columns
new_columns = data_radar.columns.values
new_columns[-6] = 'y_sack'
data_radar.columns = new_columns

# Repace * or + with ''
data_radar['Player'] = data_radar['Player'].str.replace(r'\*|\+','')


cols = ['Cmp%', 'Yds', 'Int', 'Y/A','Rate', 'G', 'Season']

#Select informations about players and ordering
final_data = data_radar[['Player', 'Tm'] + cols]
final_data = final_data.sort_values(by = ['Player', 'Season'], ascending=[True,True])
final_data = final_data.drop_duplicates(subset = 'Player', keep='first')

Output:

print(final_data)
                Player   Tm  Cmp%   Yds Int   Y/A   Rate   G  Season
53         A.J. Feeley  PHI  71.4   143   1  10.2  114.0   1    2001
41       A.J. McCarron  CIN  66.4   854   2   7.2   97.1   7    2015
3         Aaron Brooks  NOR  55.9  3832  22   6.9   76.4  16    2001
3        Aaron Rodgers  GNB  63.6  4038  13   7.5   93.8  16    2008
71         Akili Smith  CIN  62.5    37   0   4.6   73.4   2    2001
..                 ...  ...   ...   ...  ..   ...    ...  ..     ...
89       Wayne Chrebet  NYJ   0.0     0   0   0.0   39.6  15    2002
39   Zach Mettenberger  TEN  60.8   935   7   5.6   66.7   7    2015
112        Zach Pascal  IND   0.0     0   0   0.0   39.6  16    2020
27         Zach Wilson  NYJ  55.2   628   7   6.0   51.6   3    2021
105          Zay Jones  BUF   0.0     0   0   0.0   39.6  16    2018

[427 rows x 9 columns]

answered Sep 29, 2021 at 12:07

chitown88

29.1k6 gold badges34 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

GLVieira Over a year ago

Thanks, result was not perfect but presented everything that i needed to go on. These comments was wrong, i already fix this.

GLVieira Over a year ago

About my comment that you selected, for me was clearly because i already know what the coding was doing. Players that are in player_list finished as top 10 in rating statistics in last season and Mac Jones are the rookie that i want compare. I'll improve this part of my skills, thanks.

GLVieira Over a year ago

You've idea why my sort.values was not working ?

chitown88 Over a year ago

ya, when you sort, it wasn't storing the sorted dataframe, just outputing it. so you need to assign it. So would need to change from final_data.sort_values(by = 'Player', ascending=True) to final_data = final_data.sort_values(by = 'Player', ascending=True)

chitown88 Over a year ago

you could keep it the way you had it, but then need to add the inplace=True parameter. So, final_data.sort_values(by = 'Player', ascending=True, inplace=True). That would work as well

Collectives™ on Stack Overflow

Selecting different specific values in dataframe after use replace method

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related