1

I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.

I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.

I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8

...which is GREAT!

But I must have something wrong because I'm getting 0

Here's my code:

import requests
from bs4 import BeautifulSoup
import time

url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")

stat_links = []

for table_row in soup.select(".expand-section li"):

    table_cells = table_row.findAll('li')

    if len(table_cells) > 0:
        link = table_cells[0].find('a')['href']
        stat_links.append(link)

total_rank = 0

for link in stat_links:
    r = requests.get(link)
    soup = BeaultifulSoup(r.text)

    team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")

    for row in team_rows:
        if row.findAll('td')[1].text.strip() == 'Oklahoma':
            rank = row.findAll('td')[0].text.strip()
            total_rank = total_rank + rank

print total_rank

Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.

I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!

1
  • you can test each step in the python repl Commented Jan 7, 2016 at 4:36

2 Answers 2

1

First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.

teamstats = soup(class_='column large-2')[0].find_all(href=True)

The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.

links = [a['href'] for a in teamstats if a['href'] != '#']

Here is a sample of output:

links
Out[84]: 
['/ncaa-basketball/stat/points-per-game',
 '/ncaa-basketball/stat/average-scoring-margin',
 '/ncaa-basketball/stat/offensive-efficiency',
 '/ncaa-basketball/stat/floor-percentage',
 '/ncaa-basketball/stat/1st-half-points-per-game',
Sign up to request clarification or add additional context in comments.

Comments

0

A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.