0

Im trying to scrape all 5000 companies from this page. its dynamic page and companies are loaded when i scroll down. But i can only scrape 5 companies, So how can i scrape all 5000? URL is changing as I scroll down the page. I tried selenium but not working. https://www.inc.com/profile/onetrust Note: I want to scrape all info of companies but just now selected two.

import time
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

my_url = 'https://www.inc.com/profile/onetrust'

options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get(my_url)
time.sleep(3)
page = driver.page_source
driver.quit()

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

UPDATED CODE BUT PAGE IS NOT SCROLLING AT ALL. Corrected some mistake in BeautifulSoup codes

import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver

my_url = 'https://www.inc.com/profile/onetrust'

driver = webdriver.Chrome()
driver.get(my_url)


def scroll_down(self):
    """A method for scrolling the page."""

    # Get scroll height.
    last_height = self.driver.execute_script("return document.body.scrollHeight")

    while True:

        # Scroll down to the bottom.
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load the page.
        time.sleep(2)

        # Calculate new scroll height and compare with last scroll height.
        new_height = self.driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:

            break

        last_height = new_height


page_soup = soup(driver.page_source, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

Thank you for reading!

3
  • You could scroll to the end of the page, e.g. like here: stackoverflow.com/a/48851166/2776376 or you could use the API of the page you are trying to scrape, e.g. inc.com/rest/companyprofile/leadcrunch/withlist Commented Nov 5, 2020 at 22:07
  • Thanks, I will try both. May I ask how did you find API of that page? Commented Nov 6, 2020 at 2:56
  • When you open the page in a browser. you can inspect the network calls which are made in the developer tools section. Commented Nov 6, 2020 at 8:10

1 Answer 1

0

Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.

What exactly below script is doing:

  1. First it will take the API URL and do GET request.

  2. After getting the data script will parse the JSON data using json.loads library.

  3. Finally it will iterate all over the list of companies list and print them for ex:- Rank, Company name, Social account links, CEO name etc.

    import json
    import requests
    from urllib3.exceptions import InsecureRequestWarning
    requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
    
    def scrap_inc_5000():
    
    URL = 'https://www.inc.com/rest/companyprofile/nuleaf-naturals/withlist'
    
    response = requests.get(URL,verify = False)
    result = json.loads(response.text) #Parse result using JSON loads
    extracted_data = result['fullList']['listCompanies']
    for data in extracted_data:
        print('-' * 100)
        print('Rank : ',data['rank'])
        print('Company : ',data['company'])
        print('Icon : ',data['icon'])
        print('CEO Name : ',data['ifc_ceo_name'])
        print('Facebook Address : ',data['ifc_facebook_address'])
        print('File Location : ',data['ifc_filelocation'])
        print('Linkedin Address : ',data['ifc_linkedin_address'])
        print('Twitter Handle : ',data['ifc_twitter_handle'])
        print('Secondary Link : ',data['secondary_link'])
        print('-' * 100)
    scrap_inc_5000()
    
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much. It works! Although I was after website of company but I see that data is not available in API json file, which is strange. do you know why such things happened even though data is available in webpage?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.