Getting Text from Within Nested Elements in HTML Text Using BeautifulSoup in Python

Question

I'm trying to extract the Teams playing each day and the Active & Inactive players in each team's lineup. The URL for the page I'm trying to scrape is: https://stats.nba.com/lineups/. I've been using BeautifulSoup to try to get this data, and have tried a few methods to get to it, but I can't seem to extract anything within the

<div class="landing__flex-col lineups-game" data-game-state="3" nba-data-game="game" nba-with ng-include ng-repeat="game in games" src="'/lineups-template.html'">.

I want to get the teams in each matchup within each

<div class="landing__flex-col lineups-game" data-game-state="3" nba-data-game="game" nba-with ng-include ng-repeat="game in games" src="'/lineups-template.html'">,

and each player within the

<div class="columns small-6 lineups-game__team lineups-game__team--htm" nba-with nba-with-data-team="game.h" ng-include src="'/lineups-team-template.html'">.

So within the sample of html code below, I want to get the text for MEM, CHA, J. Valanciunas, and J. Crowder, and eventually do this for each player for each team.

<div class="landing__flex-row lineups-games" ng-show="isLoaded &amp;&amp; hasData" aria-hidden="false">
          <!----><!----><div class="landing__flex-col lineups-game" ng-repeat="game in games" nba-with="" nba-data-game="game" data-game-state="3" ng-include="" src="'/lineups-template.html'">
  <div class="lineups-game__inner row">

    <div class="columns small-12 lineups-game__title">
      <a href="/game/0021900154/">
        <span class="lineups-game__team-name">MEM</span>
        <span class="lineups-game__vs">vs</span>
        <span class="lineups-game__team-name">CHA</span>
        <span class="lineups-game__status hide-for-live-game">Final</span>
        <span class="lineups-game__status hide-for-pre-game hide-for-post-game">Live</span>
      </a>
    </div>

    <!----><div class="columns small-6 lineups-game__team lineups-game__team--vtm" nba-with="" nba-with-data-team="game.v" ng-include="" src="'/lineups-team-template.html'">

  <!----><!----><div ng-if="team.hasBench" nba-with="" nba-with-data-team="team" ng-include="" src="'/lineups-confirmed-roster-template.html'">
  <div class="lineups-game__header">
    <img team-logo="" class="lineups-game__team-logo team-img" abbr="MEM" type="image/svg+xml" src="/media/img/teams/logos/MEM_logo.svg" alt="Memphis Grizzlies logo" title="Memphis Grizzlies logo">
    <span class="lineups-game__team-name">MEM</span>
  </div>

  <div class="lineups-game__roster-type lineups-game__roster-type--confirmed">Active List</div>

  <ul class="lineups-game__roster lineups-game__roster--official">
    <!----><li class="lineups-game__player lineups-game__player--starter" ng-repeat="pl in team.starters">
      <a href="/player/202685/">
        <span class="lineups-game__pos">C</span>
        <span class="lineups-game__name">J. Valanciunas</span>
      </a>
    </li><!----><li class="lineups-game__player lineups-game__player--starter" ng-repeat="pl in team.starters">
      <a href="/player/203109/">
        <span class="lineups-game__pos">SF</span>
        <span class="lineups-game__name">J. Crowder</span>
      </a>

I tried by doing the following, among other methods, to no avail:

gamesSource = urllib.request.urlopen('https://stats.nba.com/lineups/').read()
gamesSoup = bs.BeautifulSoup(gamesSource,'html.parser')

teams = gamesSoup.find_all("span",{"class":"lineups-game__teams-name"})

All that ever gets returned is an empty list, and when I try to get a specific 'span' line, all that gets returned is 'None'.

Let me know what's going wrong, and what I can do to access the information I'm trying to get.

Thanks.

Sample of HTML Code

what do you want as an output? It's actually all right there with API call. — chitown88
– chitown88, Commented Nov 18, 2019 at 23:08

A_Patterson · Accepted Answer · 2019-11-15 20:23:39Z

2

Piggy-backing off the already stated, since this page is generated via api/js calls, you will need to use a different scraping library. I usually go to Selenium. The code below will pull all the teams and rosters and put them together. There may be some quirks in this code but I think it will get down the road in the right direction:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from datetime import date

desired_link = 'https://stats.nba.com/lineups/'

fire_opts = webdriver.FirefoxOptions()
fire_opts.add_argument("-headless")
fire_path = 'geckodriver.exe'
driver = webdriver.Firefox(options=fire_opts,executable_path=fire_path)
driver.get(desired_link)

team_names_list = driver.find_elements_by_class_name('lineups-game__team-name')
team_names = []
for name in team_names_list:
    team_names.append(name.text)

starting_lineup_list = driver.find_elements_by_class_name('lineups-game__roster--projected')
starting_lineup = []
for lineup in starting_lineup_list:
    starting_lineup.append(lineup.text)

driver.quit()

for teams, players in zip(team_names,starting_lineup):
    print(teams,players)

This should output all the various teams on the page like so:

DET PG D. Rose
SG L. Kennard
SF T. Snell
PF B. Griffin
C A. Drummond

Could probably be formatted a bit better but you could throw it into a spreadsheet (or whatever you like) to use as you wish...

answered Nov 15, 2019 at 20:23

A_Patterson

5553 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sam Skinner Over a year ago

Thanks. I was able to modify what you said to get exactly what I wanted. Very helpful!

Simas Joneliunas · Accepted Answer · 2019-11-15 02:30:08Z

Unfortunately, you cannot do that with urllib. The website in question uses js to call apis to populate the data after the initial page load.

The urllib is only able to download the initial file that is served by the server but is unable to deal with any subsequent actions that the file might be executing after it's initial render in the browser.

Thus the teams = gamesSoup.find_all("span",{"class":"lineups-game__teams-name"}) call returns empty as the actual HTML you download through urllib.request (as seen here) does not yet have the lineups-game__teams-name elements populated yet.

You can try examining the api calls that the website is making after the initial load (check network tab) and see if you can find where the data that you want is coming from. If you are lucky, you might be able to get to that data through the api call. As the webpage will be making lots of external requests (for images and other media) you can tick XHR to only show you remote API calls in the network list.

If you cannot find the api or if it is blocked from external calls, you can alternatively try js enabled python browsers (i.e. selenium) to download the page that includes and executes the JS code.

chitown88 · Accepted Answer · 2019-11-18 23:12:25Z

1

You can get it by call to api. Just dynamically change the date parameter. Here's an example: You'll need to either iterate through the games/indexes or flatten out the json format and reconstruct into a dataframe:

import pandas as pd
import requests

url = 'https://stats.nba.com/js/data/dailylineups/2019/daily_lineups_20191118.json'
jsonData = requests.get(url).json()

print (pd.DataFrame(jsonData['results'][0]['LAC']))

Output:

  firstName  lastName playerId pos rotoId team
0   Patrick  Beverley   201976  PG   3072  LAC
1   Terance      Mann  1629611  SG   4860  LAC
2     Kawhi   Leonard   202695  SF   3195  LAC
3      Paul    George   202331  PF   3114  LAC
4     Ivica     Zubac   162726   C   3888  LAC

answered Nov 18, 2019 at 23:12

chitown88

29.1k6 gold badges34 silver badges67 bronze badges

Collectives™ on Stack Overflow

Getting Text from Within Nested Elements in HTML Text Using BeautifulSoup in Python

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related