Scraping content with python and selenium

Question

I would like to extract all the league names (e.g. England Premier League, Scotland Premiership, etc.) from this website https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1

Taking the inspector tools from Chrome/Firefox I can see that they are located here:

<span>England Premier League</span>

So I tried this

from lxml import html

from selenium import webdriver

session = webdriver.Firefox()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
session.get(url)
tree = html.fromstring(session.page_source)
leagues = tree.xpath('//span/text()')
print(leagues)

Unfortunately this doesn't return the desired results :-(

To me it looks like the website has different frames and I'm extracting the content from the wrong frame.

Could anyone please help me out here or point me in the right direction? As an alternative if someone knows how to extract the information through their api then this would obviously be the superior solution.

Any help is much appreciated. Thank you!

Try to import requests and then parse tree = html.fromstring(requests.get("https://mobile.bet365.com/V6/sport/splash/splash.aspx?zone=0&isocode=RO&tzi=4&key=1&gn=0&cid=1&lng=1&ctg=1&ct=156&clt=8881&ot=2").content) — Andersson
– Andersson, Commented Sep 20, 2017 at 10:06
Thank you for your suggestion. Unfortunately this is not a solution for me as I'd need to emulate a real browser session and therefore need to use selenium (requests will not work and any attempt to scrape the content using this library will result in an IP-block from bet365). Also tried your url using selenium which returns an empty list. — Baili
– Baili, Commented Sep 20, 2017 at 10:51
Sometimes when you copy/paste something from SO it might contain hidden characters, so yeah, URL provided in my comment seem to be OK, but it's broken if to copy it... You can check the same URL from my answer. Also check answer itself. It returns desired output without extra text and there is no need to use time.sleep() and BeautifulSoup — Andersson
– Andersson, Commented Sep 20, 2017 at 17:31

thebadguy · Accepted Answer · 2017-09-20 11:02:50Z

2

Hope you are looking for something like this:

from selenium import webdriver
import  bs4, time

driver = webdriver.Chrome()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'


driver.get(url)
driver.maximize_window()
# sleep is given so that JS populate data in this time
time.sleep(10)
pSource= driver.page_source

soup = bs4.BeautifulSoup(pSource, "html.parser")


for data in soup.findAll('div',{'class':'eventWrapper'}):
    for res in data.find_all('span'):
        print res.text

It will print the below data:

Wednesday's Matches
International List
Elite Euro List
UK List
Australia List
Club Friendly List
England Premier League
England EFL Cup
England Championship
England League 1
England League 2
England National League
England National League North
England National League South
Scotland Premiership
Scotland League Cup
Scotland Championship
Scotland League One
Scotland League Two
Northern Ireland Reserve League
Scotland Development League East
Wales Premier League
Wales Cymru Alliance
Asia - World Cup Qualifying
UEFA Champions League
UEFA Europa League
Wednesday's Matches
International List
Elite Euro List
UK List
Australia List
Club Friendly List
England Premier League
England EFL Cup
England Championship
England League 1
England League 2
England National League
England National League North
England National League South
Scotland Premiership
Scotland League Cup
Scotland Championship
Scotland League One
Scotland League Two
Northern Ireland Reserve League
Scotland Development League East
Wales Premier League
Wales Cymru Alliance
Asia - World Cup Qualifying
UEFA Champions League
UEFA Europa League

Only problem is its printing result set twice

answered Sep 20, 2017 at 11:02

thebadguy

2,1401 gold badge22 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Baili Over a year ago

Absolutely phenomenal, works perfectly! Many many thanx, printing the results twice is not a problem at all.

Baili Over a year ago

In fact adding time.sleep(10) to my script also works. Thank you for pointing out this essential part. JS obviously needs some time to populate the data!

JeffC Over a year ago

The problem is that you are scraping every SPAN on the page which is resulting in too many results... results that you don't want. If you change it to the CSS selector div.podSplashRow :not(.empty), you will return only the list once. You still will get the names of the lists at the top of the page, but I don't see a way to programmatically remove those at first glance.

Andersson · Accepted Answer · 2017-09-21 04:56:29Z

1

Required content is absent in initial page source. It comes dynamically from https://mobile.bet365.com/V6/sport/splash/splash.aspx?zone=0&isocode=RO&tzi=4&key=1&gn=0&cid=1&lng=1&ctg=1&ct=156&clt=8881&ot=2

To be able to get this content you can use ExplicitWait as below:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver

session = webdriver.Firefox()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
session.get(url)
WebDriverWait(session, 10).until(EC.presence_of_element_located((By.ID, 'Splash')))

for collapsed in session.find_elements_by_xpath('//h3[contains(@class, "collapsed")]'):
    collapsed.location_once_scrolled_into_view
    collapsed.click()

for event in session.find_elements_by_xpath('//div[contains(@class, "eventWrapper")]//span'):
    print(event.text)

edited Sep 21, 2017 at 4:56

answered Sep 20, 2017 at 11:19

Andersson

52.8k18 gold badges83 silver badges132 bronze badges

7 Comments

JeffC Over a year ago

Your locator is only returning the UK leagues... there are other leagues further down the page.

Andersson Over a year ago

Yep. It's not qiute clear which elements OP actually wants to get as, for example, "Wednesday's Matches" should not be included in all the league names as it's obviously not a League name...

Baili Over a year ago

I would be interested in all leagues. It seems your locator only returns leagues that are not collapsed. The "Wednesday's Matches" is not a problem and can be included.

JeffC Over a year ago

@Baili Any locator is going to return only the leagues that aren't collapsed because they are the only ones that are visible. It's not clear in your question which names you wanted.

Baili Over a year ago

@JeffC Thank you. I would need all league names, also those that are collapsed by default when visiting the website. E.g. also Italy Serie A, Italy Serie B, Spain Primera Liga,...

|

Collectives™ on Stack Overflow

Scraping content with python and selenium

2 Answers 2

3 Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related