3

Background

Hi All...new to python and web scraping. I'm on a Mac (Sierra) running Jupyter Notebook in Firefox (87.0). I'm trying to scrape several values from a webpage like this one: https://www.replaypoker.com/tournaments/4337873. One example of a value I'd like to scrape is the Tournament ID.

What I've Tried

I first tried using BeautifulSoup, but the problem is that many of this page's elements are not written into the HTML. They appear to be stored in variables (javascript?) that need to be calculated and then scraped, so the below BeautifulSoup code just spit out the variable name as a string instead of the value.

import bs4
import requests
from bs4 import BeautifulSoup

url = 'https://www.replaypoker.com/tournaments/4337873'
xml_soup = bs4.BeautifulSoup(response.content,'xml')
tournament_ID = html_soup.find('strong',text='Tournament ID:')
print(tournament_ID.next_sibling.strip())

This returned #{{id}} when I wanted #4337873.

Reading a bit online, I learned that Selenium may address this issue by opening a headless instance of my browser, so I decided to switch and use Selenium. The problem is that I don't know how to get the value of the variable once I find the right element.

from selenium import webdriver
import time

running_tournament_url = 'https://www.replaypoker.com/tournaments/4337873'
driver = webdriver.Firefox(executable_path='/Users/maxwilliams/WebDrivers/geckodriver')
driver.get(running_tournament_url)
assert 'MTT' in driver.title

#tournament_id = driver.find_element_by_css_selector('div.col-xs-6:nth-child(1) > div:nth-child(2) > strong:nth-child(1)')
tournament_id = driver.find_element_by_xpath('/html/body/div[2]/section/div/div[1]/div[1]/div/div[1]/div[2]/strong')
print(tournament_id.text)

seats = driver.find_element_by_class_name('tournaments-seats-per-table')
print(seats.text)
    
time.sleep(3)
driver.quit()

This code spits out Tournament ID: but still not the tournament ID itself. I find this especially confusing because the above code for seats above will print Seats Per Table: 9, i.e. the label and the value.

Questions

  1. Was my decision to use Selenium necessary and correct? Or could this better accomplished with another library?
  2. How can I scrape the tournament ID value (and others like it)?
3
  • 1
    Congratulations on doing your research before posting on Stack Overflow! I wish more users did this. Commented Mar 30, 2021 at 0:00
  • What other data are you after from the page? You may be able to avoid selenium. Commented Mar 30, 2021 at 1:30
  • Thanks! Using the code from QHarr below as it's exactly what I need and no need for Selenium or a browser. Commented Mar 30, 2021 at 23:07

2 Answers 2

2

That data is dynamically pulled form a script tag meaning you can use requests and re to grab the relevant string and then parse with json. This avoids the overhead of a browser.

import requests, re, json
import pandas as pd

r = requests.get('https://www.replaypoker.com/tournaments/4337873')
data = json.loads(re.search(r'RP\.data = (.*?);\n+', r.text, flags=re.S).group(1))
print(data['tournament']['id'])
df = pd.DataFrame(data['tournament']['winners'])
df.prizes = df.prizes.apply(lambda x: x[0] if x else '')
print(df)
# print(data) ## other data also present

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

Super helpful! Much simpler than Selenium and you even correctly predicted my next step of scraping player data, which saved me hours of searching.
-1

Using Selenium definitely can help you. You've just mistaken when writing out XPath (I recommend direct copying from "Inspect" pane of Mozilla Firefox). First - you don't need to include strong, as it belongs only to 'Tournament ID:' text, not to ID itself. Addressing the whole div will fit. And when I opened, the correct XPath was the following:

tournament_id = driver.find_element_by_xpath('/html/body/div[3]/section/div/div[1]/div[1]/div/div[1]/div[2]')

It gives Tournament ID: #4337873 as expected. Getting ID from this string is easy (just split by # and take second part).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.