4

Below is a scraper that loops through two websites, scrapes a team's roster information, puts the information into an array, and exports the arrays into a CSV file. Everything works great, but the only problem is the writerow headers repeat in the csv file every time the scraper moves on to the second website. Is it possible to adjust the CSV portion of the code to have the headers only appear once when the scraper is looping through multiple websites? Thanks in advance!

import requests
import csv
from bs4 import BeautifulSoup

team_list={'yankees','redsox'}

for team in team_list:
    page = requests.get('http://m.{}.mlb.com/roster/'.format(team))
    soup = BeautifulSoup(page.text, 'html.parser')

    soup.find(class_='nav-tabset-container').decompose()
    soup.find(class_='column secondary span-5 right').decompose()

    roster = soup.find(class_='layout layout-roster')
    names = [n.contents[0] for n in roster.find_all('a')]
    ids = [n['href'].split('/')[2] for n in roster.find_all('a')]
    number = [n.contents[0] for n in roster.find_all('td', index='0')]
    handedness = [n.contents[0] for n in roster.find_all('td', index='3')]
    height = [n.contents[0] for n in roster.find_all('td', index='4')]
    weight = [n.contents[0] for n in roster.find_all('td', index='5')]
    DOB = [n.contents[0] for n in roster.find_all('td', index='6')]
    team = [soup.find('meta',property='og:site_name')['content']] * len(names)

    with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
        f = csv.writer(fp)
        f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])
        f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))
2
  • Have you tried moving f.writerow above for team in team_list? Commented Jul 15, 2018 at 23:30
  • Just write the header before the for loop. This means that the for loop should be wrapped within the with context manager. Commented Jul 16, 2018 at 0:12

3 Answers 3

3

Using a variable to check if header is added or not may be helpful. If header added it will not add second times

header_added = False
for team in team_list:
    do_some stuff

    with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
        f = csv.writer(fp)
        if not header_added:
            f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])
            header_added = True
        f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))
Sign up to request clarification or add additional context in comments.

Comments

2

Another method would be to simply do it before the for loop so you do not have to check if already written.

import requests
import csv
from bs4 import BeautifulSoup

team_list={'yankees','redsox'}

with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
    f = csv.writer(fp)
    f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])

for team in team_list:
    do_your_bs4_and_parsing_stuff

    with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
        f = csv.writer(fp)
        f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))

You can also open the document just once instead of three times as well

import requests
import csv
from bs4 import BeautifulSoup

team_list={'yankees','redsox'}

with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
    f = csv.writer(fp)
    f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])

    for team in team_list:
        do_your_bs4_and_parsing_stuff

        f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))

Comments

1

Just write the header before the loop, and have the loop within the with context manager:

import requests
import csv
from bs4 import BeautifulSoup

team_list = {'yankees', 'redsox'}

headers = ['Name', 'ID', 'Number', 'Hand', 'Height', 'Weight', 'DOB', 'Team']

# 1. wrap everything in context manager
with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
    f = csv.writer(fp)

    # 2. write headers before anything else
    f.writerow(headers)

    # 3. now process the loop
    for team in team_list:
        # Do everything else...

You could also define your headers similarily to team_list outside the loop, which leads to cleaner code.

3 Comments

Thanks for the advice RoadRunner! The code ran but unfortunately, it returned an empty CSV file. Do I have to include the writerow zip line at the end of the for loop?
@NateWalker Yeah the writerow should be at the end of the loop.
@NateWalker No worries.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.