0

I'm building an automated MLB schedule from a base URL and a loop through a list of team names as they appear in the URL. Using pd.read_html I get each team's schedule. The only thing I'm missing is, for each team's page, the team name itself, which I'd like as a new column 'team_name'. I have a small sample of my goal at the end of this post.

Below, is what I have so far, and if you run this, the print out does exactly what I need for just one team.

import pandas as pd
url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners']

df = pd.DataFrame()
for team in (team_list):
    new_url = url_base + team
    df = df.append(pd.read_html(new_url)[1])
    df['team_name'] = team

print(df[['team_name', 'Opponent']]) 

The trouble is, when I have all 30 teams in team_list, the value of team_name keeps getting overwritten, so that all 4000+ records list the same team name (the last one in team_list). I've tried dynamically assigning only certain rows the team value by using

df['team_name'][a:b] = team

where a, b are the starting and ending rows on the dataframe for the index team; but this gives KeyError: 'team_name'. I've also tried using placeholder series and dataframes for team_name, then merging with df later, but get duplication errors. On a larger scale, what I'm looking for is this:

           team_name          opponent
0   seattle-mariners  new-york-yankees
1   seattle-mariners  new-york-yankees
2   seattle-mariners    boston-red-sox
3   seattle-mariners    boston-red-sox
4   seattle-mariners  san-diego-padres
5   seattle-mariners  san-diego-padres
6    cincinatti-reds  new-york-yankees
7    cincinatti-reds  new-york-yankees
8    cincinatti-reds    boston-red-sox
9    cincinatti-reds    boston-red-sox
10   cincinatti-reds  san-diego-padres
11   cincinatti-reds  san-diego-padres

1 Answer 1

4

The original code df['team_name'] = team rewrites team_name for the entire df. The code below creates a placeholder, df_team, where team_name is updated and then df.append(df_team).

url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners', 'houston-astros']

Option A: for loop

df_list = list()

for team in (team_list):
    new_url = url_base + team
    df_team = pd.read_html(new_url)[1]
    df_team['team_name'] = team
    df_list.append(df_team)

df = pd.concat(df_list)

Option B: list comprehension:

df_list = [pd.read_html(url_base + team)[1].assign(team=team) for team in team_list]

df = pd.concat(df_list)

df.head()

enter image description here

df.tail()

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.