Webscraping Return Variable Value Using Python

Question

Background

Hi All...new to python and web scraping. I'm on a Mac (Sierra) running Jupyter Notebook in Firefox (87.0). I'm trying to scrape several values from a webpage like this one: https://www.replaypoker.com/tournaments/4337873. One example of a value I'd like to scrape is the Tournament ID.

What I've Tried

I first tried using BeautifulSoup, but the problem is that many of this page's elements are not written into the HTML. They appear to be stored in variables (javascript?) that need to be calculated and then scraped, so the below BeautifulSoup code just spit out the variable name as a string instead of the value.

import bs4
import requests
from bs4 import BeautifulSoup

url = 'https://www.replaypoker.com/tournaments/4337873'
xml_soup = bs4.BeautifulSoup(response.content,'xml')
tournament_ID = html_soup.find('strong',text='Tournament ID:')
print(tournament_ID.next_sibling.strip())

This returned #{{id}} when I wanted #4337873.

Reading a bit online, I learned that Selenium may address this issue by opening a headless instance of my browser, so I decided to switch and use Selenium. The problem is that I don't know how to get the value of the variable once I find the right element.

from selenium import webdriver
import time

running_tournament_url = 'https://www.replaypoker.com/tournaments/4337873'
driver = webdriver.Firefox(executable_path='/Users/maxwilliams/WebDrivers/geckodriver')
driver.get(running_tournament_url)
assert 'MTT' in driver.title

#tournament_id = driver.find_element_by_css_selector('div.col-xs-6:nth-child(1) > div:nth-child(2) > strong:nth-child(1)')
tournament_id = driver.find_element_by_xpath('/html/body/div[2]/section/div/div[1]/div[1]/div/div[1]/div[2]/strong')
print(tournament_id.text)

seats = driver.find_element_by_class_name('tournaments-seats-per-table')
print(seats.text)
    
time.sleep(3)
driver.quit()

This code spits out Tournament ID: but still not the tournament ID itself. I find this especially confusing because the above code for seats above will print Seats Per Table: 9, i.e. the label and the value.

Questions

Was my decision to use Selenium necessary and correct? Or could this better accomplished with another library?
How can I scrape the tournament ID value (and others like it)?

Congratulations on doing your research before posting on Stack Overflow! I wish more users did this. — Anonymous
– Anonymous, Commented Mar 30, 2021 at 0:00
What other data are you after from the page? You may be able to avoid selenium. — QHarr
– QHarr, Commented Mar 30, 2021 at 1:30
Thanks! Using the code from QHarr below as it's exactly what I need and no need for Selenium or a browser. — Fist Pump Cat
– Fist Pump Cat, Commented Mar 30, 2021 at 23:07

QHarr · Accepted Answer · 2021-03-30 04:35:51Z

2

That data is dynamically pulled form a script tag meaning you can use requests and re to grab the relevant string and then parse with json. This avoids the overhead of a browser.

import requests, re, json
import pandas as pd

r = requests.get('https://www.replaypoker.com/tournaments/4337873')
data = json.loads(re.search(r'RP\.data = (.*?);\n+', r.text, flags=re.S).group(1))
print(data['tournament']['id'])
df = pd.DataFrame(data['tournament']['winners'])
df.prizes = df.prizes.apply(lambda x: x[0] if x else '')
print(df)
# print(data) ## other data also present

edited Mar 30, 2021 at 4:35

answered Mar 30, 2021 at 4:14

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Fist Pump Cat Over a year ago

Super helpful! Much simpler than Selenium and you even correctly predicted my next step of scraping player data, which saved me hours of searching.

STerliakov · Accepted Answer · 2021-03-30 00:03:16Z

-1

Using Selenium definitely can help you. You've just mistaken when writing out XPath (I recommend direct copying from "Inspect" pane of Mozilla Firefox). First - you don't need to include strong, as it belongs only to 'Tournament ID:' text, not to ID itself. Addressing the whole div will fit. And when I opened, the correct XPath was the following:

tournament_id = driver.find_element_by_xpath('/html/body/div[3]/section/div/div[1]/div[1]/div/div[1]/div[2]')

It gives Tournament ID: #4337873 as expected. Getting ID from this string is easy (just split by # and take second part).

answered Mar 30, 2021 at 0:03

STerliakov

8,8873 gold badges27 silver badges59 bronze badges

Collectives™ on Stack Overflow

Webscraping Return Variable Value Using Python

Background

What I've Tried

Questions

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Background

What I've Tried

Questions

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related