Webscraping with BeautifulSoup multiple pages using click() method

Question

I want to webscrape data from the imdb. In order to do it for multiple pages i have used click() method of the selenum package.

Here is my code:

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

pages = [str(i) for i in range(10)]

#getting url for each page and year:
url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'
driver = webdriver.Chrome(r"C:\Users\yefida\Desktop\Study_folder\Online_Courses\The Complete Python Course\Project 2 - Quotes Webscraping\chromedriver.exe")
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

for page in pages:
    data = soup.find_all('div', class_ = 'lister-item mode-advanced')
    data_list = []
    for item in data:
        temp = {}
    #Name of movie
        temp['movie'] = item.h3.a.text
    #Year
        temp['year'] = item.find('span',{'class':'lister-item-year text-muted unbold'}).text.replace('(','').replace(')','').replace('I','').replace('–','')
    #Runtime in minutes
        temp['time'] = item.find('span',{'class':'runtime'}).text.replace(' min','')
    #Genre
        temp['genre'] = item.find('span',{'class':'genre'}).text.replace(' ','').replace('\n','')
    #Raiting of users
        temp['raiting'] = item.find('div',{'class':'inline-block ratings-imdb-rating'}).text.replace('\n','').replace(',','.')
    #Metascore
        try:
            temp['metascore'] = item.find('div',{'class':'inline-block ratings-metascore'}).text.replace('\n','').replace('Metascore','').replace(' ','')
        except:
            temp['metascore'] = None
        data_list.append(temp)

    #next page
    continue_link = driver.find_element_by_link_text('Next')
    continue_link.click()

At the end I am getting an error:

'Message: no such element: Unable to locate element: {"method":"link text","selector":"Next"}
  (Session info: chrome=70.0.3538.102)
'

Can you help me to cerrect it?

SIM · Accepted Answer · 2018-11-23 21:28:42Z

1

Complying the following logic you can update your soup element with the new page content. I used xpath '//a[contains(.,"Next")]' to click on the next page button. The script should keep clicking on the next page button until there is no more button to click and finally break out of it. Give it a go:

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'

driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")

while True:
    items = [itm.get_text(strip=True) for itm in soup.select('.lister-item-content a[href^="/title/"]')]
    print(items)

    try:
        driver.find_element_by_xpath('//a[contains(.,"Next")]').click()
        soup = BeautifulSoup(driver.page_source,"lxml")
    except Exception: break

answered Nov 23, 2018 at 21:28

SIM

22.5k6 gold badges45 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

SIM Over a year ago

If you don't have lxml installed in your machine, try using html.parser instead within BeautifulSoup().

DY92 Over a year ago

Thank you very much! It helped me a lot :)

JaSON · Accepted Answer · 2018-11-23 20:31:23Z

1

That's because link text is actually "Next »", so try either

continue_link = driver.find_element_by_link_text('Next »')

or

continue_link = driver.find_element_by_partial_link_text('Next')

answered Nov 23, 2018 at 20:31

JaSON

4,8912 gold badges12 silver badges18 bronze badges

3 Comments

JaSON Over a year ago

@DY92 , did you get an exception?

DY92 Over a year ago

No, just data from the first page

JaSON Over a year ago

@DY92 , that's because you're using BeautifulSoup which parses just the same source page on each iteration. You don't need BeautifulSoup - try common Selenium methods and properties

QHarr · Accepted Answer · 2018-11-23 20:56:05Z

1

You could also use a CSS selector target the class of the next button

driver.find_element_by_css_selector('.lister-page-next.next-page').click()

This class is consistent across pages. You could add a wait for element to be clickable:

WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.lister-page-next.next-page')))

My understanding is that CSS selector should be a fast matching method. Some benchmarks here.

edited Nov 23, 2018 at 20:56

answered Nov 23, 2018 at 20:46

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

2 Comments

DY92 Over a year ago

Thank you! It works, but not exactly what I want: It scrapes only the first page out of 10.

QHarr Over a year ago

The class remains the same across pages so this should work if used on each new page. You could add wait for element to become clickable.

Collectives™ on Stack Overflow

Webscraping with BeautifulSoup multiple pages using click() method

3 Answers 3

2 Comments

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related