Failing to scrape dynamic webpage using selenium in python

Question

Im trying to scrape all 5000 companies from this page. its dynamic page and companies are loaded when i scroll down. But i can only scrape 5 companies, So how can i scrape all 5000? URL is changing as I scroll down the page. I tried selenium but not working. https://www.inc.com/profile/onetrust Note: I want to scrape all info of companies but just now selected two.

import time
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

my_url = 'https://www.inc.com/profile/onetrust'

options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get(my_url)
time.sleep(3)
page = driver.page_source
driver.quit()

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

UPDATED CODE BUT PAGE IS NOT SCROLLING AT ALL. Corrected some mistake in BeautifulSoup codes

import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver

my_url = 'https://www.inc.com/profile/onetrust'

driver = webdriver.Chrome()
driver.get(my_url)


def scroll_down(self):
    """A method for scrolling the page."""

    # Get scroll height.
    last_height = self.driver.execute_script("return document.body.scrollHeight")

    while True:

        # Scroll down to the bottom.
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load the page.
        time.sleep(2)

        # Calculate new scroll height and compare with last scroll height.
        new_height = self.driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:

            break

        last_height = new_height


page_soup = soup(driver.page_source, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

Thank you for reading!

You could scroll to the end of the page, e.g. like here: stackoverflow.com/a/48851166/2776376 or you could use the API of the page you are trying to scrape, e.g. inc.com/rest/companyprofile/leadcrunch/withlist — Maximilian Peters
– Maximilian Peters, Commented Nov 5, 2020 at 22:07
Thanks, I will try both. May I ask how did you find API of that page? — yf879
– yf879, Commented Nov 6, 2020 at 2:56
When you open the page in a browser. you can inspect the network calls which are made in the developer tools section. — Maximilian Peters
– Maximilian Peters, Commented Nov 6, 2020 at 8:10

Vin · Accepted Answer · 2020-11-06 11:47:23Z

0

Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.

What exactly below script is doing:

First it will take the API URL and do GET request.
After getting the data script will parse the JSON data using json.loads library.

Finally it will iterate all over the list of companies list and print them for ex:- Rank, Company name, Social account links, CEO name etc.

import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

def scrap_inc_5000():

URL = 'https://www.inc.com/rest/companyprofile/nuleaf-naturals/withlist'

response = requests.get(URL,verify = False)
result = json.loads(response.text) #Parse result using JSON loads
extracted_data = result['fullList']['listCompanies']
for data in extracted_data:
    print('-' * 100)
    print('Rank : ',data['rank'])
    print('Company : ',data['company'])
    print('Icon : ',data['icon'])
    print('CEO Name : ',data['ifc_ceo_name'])
    print('Facebook Address : ',data['ifc_facebook_address'])
    print('File Location : ',data['ifc_filelocation'])
    print('Linkedin Address : ',data['ifc_linkedin_address'])
    print('Twitter Handle : ',data['ifc_twitter_handle'])
    print('Secondary Link : ',data['secondary_link'])
    print('-' * 100)
scrap_inc_5000()

answered Nov 6, 2020 at 11:47

Vin

9842 gold badges10 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

yf879 Over a year ago

Thank you so much. It works! Although I was after website of company but I see that data is not available in API json file, which is strange. do you know why such things happened even though data is available in webpage?

Collectives™ on Stack Overflow

Failing to scrape dynamic webpage using selenium in python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related