Python script extract data from HTML page

Question

I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.

I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.

I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8

...which is GREAT!

But I must have something wrong because I'm getting 0

Here's my code:

import requests
from bs4 import BeautifulSoup
import time

url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")

stat_links = []

for table_row in soup.select(".expand-section li"):

    table_cells = table_row.findAll('li')

    if len(table_cells) > 0:
        link = table_cells[0].find('a')['href']
        stat_links.append(link)

total_rank = 0

for link in stat_links:
    r = requests.get(link)
    soup = BeaultifulSoup(r.text)

    team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")

    for row in team_rows:
        if row.findAll('td')[1].text.strip() == 'Oklahoma':
            rank = row.findAll('td')[0].text.strip()
            total_rank = total_rank + rank

print total_rank

Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.

I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!

you can test each step in the python repl

Fabricator
– Fabricator

2016-01-07 04:36:47 +00:00
Commented Jan 7, 2016 at 4:36 — Fabricator
– Fabricator, Commented Jan 7, 2016 at 4:36

floydn · Accepted Answer · 2016-01-07 08:31:22Z

1

First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.

teamstats = soup(class_='column large-2')[0].find_all(href=True)

The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.

links = [a['href'] for a in teamstats if a['href'] != '#']

Here is a sample of output:

links
Out[84]: 
['/ncaa-basketball/stat/points-per-game',
 '/ncaa-basketball/stat/average-scoring-margin',
 '/ncaa-basketball/stat/offensive-efficiency',
 '/ncaa-basketball/stat/floor-percentage',
 '/ncaa-basketball/stat/1st-half-points-per-game',

answered Jan 7, 2016 at 8:31

floydn

1,1298 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Michael Sova · Accepted Answer · 2016-01-07 04:58:35Z

0

A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.

answered Jan 7, 2016 at 4:58

Michael Sova

12 bronze badges

Collectives™ on Stack Overflow

Python script extract data from HTML page

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related