1

I want to webscrape data from the imdb. In order to do it for multiple pages i have used click() method of the selenum package.

Here is my code:

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

pages = [str(i) for i in range(10)]

#getting url for each page and year:
url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'
driver = webdriver.Chrome(r"C:\Users\yefida\Desktop\Study_folder\Online_Courses\The Complete Python Course\Project 2 - Quotes Webscraping\chromedriver.exe")
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

for page in pages:
    data = soup.find_all('div', class_ = 'lister-item mode-advanced')
    data_list = []
    for item in data:
        temp = {}
    #Name of movie
        temp['movie'] = item.h3.a.text
    #Year
        temp['year'] = item.find('span',{'class':'lister-item-year text-muted unbold'}).text.replace('(','').replace(')','').replace('I','').replace('–','')
    #Runtime in minutes
        temp['time'] = item.find('span',{'class':'runtime'}).text.replace(' min','')
    #Genre
        temp['genre'] = item.find('span',{'class':'genre'}).text.replace(' ','').replace('\n','')
    #Raiting of users
        temp['raiting'] = item.find('div',{'class':'inline-block ratings-imdb-rating'}).text.replace('\n','').replace(',','.')
    #Metascore
        try:
            temp['metascore'] = item.find('div',{'class':'inline-block ratings-metascore'}).text.replace('\n','').replace('Metascore','').replace(' ','')
        except:
            temp['metascore'] = None
        data_list.append(temp)

    #next page
    continue_link = driver.find_element_by_link_text('Next')
    continue_link.click()

At the end I am getting an error:

'Message: no such element: Unable to locate element: {"method":"link text","selector":"Next"}
  (Session info: chrome=70.0.3538.102)
'

Can you help me to cerrect it?

3 Answers 3

1

Complying the following logic you can update your soup element with the new page content. I used xpath '//a[contains(.,"Next")]' to click on the next page button. The script should keep clicking on the next page button until there is no more button to click and finally break out of it. Give it a go:

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'

driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")

while True:
    items = [itm.get_text(strip=True) for itm in soup.select('.lister-item-content a[href^="/title/"]')]
    print(items)

    try:
        driver.find_element_by_xpath('//a[contains(.,"Next")]').click()
        soup = BeautifulSoup(driver.page_source,"lxml")
    except Exception: break
Sign up to request clarification or add additional context in comments.

2 Comments

If you don't have lxml installed in your machine, try using html.parser instead within BeautifulSoup().
Thank you very much! It helped me a lot :)
1

That's because link text is actually "Next »", so try either

continue_link = driver.find_element_by_link_text('Next »')

or

continue_link = driver.find_element_by_partial_link_text('Next')

3 Comments

@DY92 , did you get an exception?
No, just data from the first page
@DY92 , that's because you're using BeautifulSoup which parses just the same source page on each iteration. You don't need BeautifulSoup - try common Selenium methods and properties
1

You could also use a CSS selector target the class of the next button

driver.find_element_by_css_selector('.lister-page-next.next-page').click()

This class is consistent across pages. You could add a wait for element to be clickable:

WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.lister-page-next.next-page')))

My understanding is that CSS selector should be a fast matching method. Some benchmarks here.

2 Comments

Thank you! It works, but not exactly what I want: It scrapes only the first page out of 10.
The class remains the same across pages so this should work if used on each new page. You could add wait for element to become clickable.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.