Python script extract from HTML

Question

I'm writing a script that scans through a set of links. Within each link the script searches a table for a row. Once found, it increments the variable total_rank which is the sum ranks found on each web page. The rank is equal to the row number.

The code looks like this and is outputting zero:

import requests
from bs4 import BeautifulSoup
import time

url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")

stat_links = []

for a in soup.select(".chooser-list ul"):
    list_entry = a.findAll('li')
    relative_link = list_entry[0].find('a')['href']
    link = "https://www.teamrankings.com" + relative_link
    stat_links.append(link)

total_rank = 0

for link in stat_links:
    r = requests.get(link)
    soup = BeautifulSoup(r.text, "html.parser")

    team_rows = soup.select(".tr-table.datatable.scrollable.dataTable.no-footer table")

    for row in team_rows:
        if row.findAll('td')[1].text.strip() == 'Oklahoma':
            rank = row.findAll('td')[0].text.strip()
            total_rank = total_rank + rank

    # time.sleep(1)

print total_rank

debugging team_rows is empty after the select() call thing is, I've also tried different tags. For example I've tried soup.select(".scroll-wrapper div") I've tried soup.select("#DataTables_Table_0_wrapper div") all are returning nothing

I don't think that string = str(a) is what you want. It return a text representation of an element. — mic4ael
– mic4ael, Commented Jan 7, 2016 at 21:32
@mic4ael am I wrong that .get takes a string as an input? or is that what you're saying? — kendall weihe
– kendall weihe, Commented Jan 7, 2016 at 21:38

audiodude · Accepted Answer · 2016-01-07 23:24:09Z

3

The selector

".tr-table datatable scrollable dataTable no-footer tr"

Selects a <tr> element anywhere under a <no-footer> element anywhere under a <dataTable> element....etc.

I think really "datatable scrollable dataTable no-footer" are classes on your .tr-table? So in that case, they should be joined with the first class with a period. So I believe the final correct selector is:

".tr-table.datatable.scrollable.dataTable.no-footer tr"

UPDATE: the new selector looks like this:

".tr-table.datatable.scrollable.dataTable.no-footer table"

The problem here is that the first part, .tr-table.datatable... refers to the table itself. Assuming you're trying to get the rows of this table:

<table class="tr-table datatable scrollable dataTable no-footer" id="DataTables_Table_0" role="grid">

The proper selector remains the one I originally suggested.

edited Jan 7, 2016 at 23:24

answered Jan 7, 2016 at 21:33

audiodude

2,87025 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

kendall weihe Over a year ago

I think you're correct, but that didn't fix the underlying problem

kendall weihe Over a year ago

please check the post I have updated the code and found a new place I think the error is

alecxe Over a year ago

@audiodude good explanation. Posted a more practical-focused answer, check it out.

alecxe · Accepted Answer · 2016-01-08 01:45:03Z

The @audiodude's answer is correct though the suggested selector is not working for me.

You don't need to check every single class of the table element. Here is the working selector:

team_rows = soup.select("table.datatable tr")

Also, if you need to find Oklahoma inside the table - you don't have to iterate over every row and cell in the table. Just directly search for a specific cell and get the previous containing the rank:

rank = soup.find("td", {"data-sort": "Oklahoma"}).find_previous_sibling("td").get_text()
total_rank += int(rank)  # it is important to convert the row number to int

Also note that you are extracting more stats links than you should - looks like the Player Stats links should not be followed since you are focused specifically on the Team Stats. Here is one way to get Team Stats links only:

links_list = soup.find("h2", text="Team Stats").find_next_sibling("ul")
stat_links = ["https://www.teamrankings.com" + a["href"] 
              for a in links_list.select("ul.expand-content li a[href]")]

Collectives™ on Stack Overflow

Python script extract from HTML

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related