1

I am new to web scraping; I am trying to scrape information water utilities from this site. I am currently able to successfully navigate through each region through a drop down, and access the first page. I am currently unable to successfully navigate to the next page for all pages before going to the next region. The page navigation bar is a list with no 'Next' button, and I currently try to iterate through the list using range. I don't get the correct range for the list when I get the len. As it stands, I am able to go to only the first page of each region. I am struggling to figure out what I am doing wrong or what to consider, even after trying to look for answers on similar questions. Any help to this end will be highly appreciated.

Thanks!

Here is my current code(I didn't scraping, focused on navigating pages):

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException

url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Firefox()
browser.get(url)
time.sleep(3)
print("Retriving the site...")

# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']


for region in regions:
   # Select all options from drop down menu
   selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

   print("Now constructing output for: " + region)

   # Select table and wait for data to populate
   selectOption.select_by_visible_text(region)

   time.sleep(4)
   
   list_of_table_pages = browser.find_element_by_xpath('//*[@id="MainContent_gvUtilities"]/tbody/tr[52]/td/ul')
   no_pages = len(list_of_table_pages.find_elements_by_xpath("//li"))

   print(("No of table pages to be scraped are: %d") %no_pages)
   
   print("Outputing data into "+ region +".csv...")

   all_table_data = []

   # starts the range count from 1 instead of 0
   for page in range(1, no_pages):
      try:
        
        #Navigate to the next page once done
        table_page = str(page)
        WebDriverWait(browser, 20).until(EC.visibility_of_element_located((By.XPATH, '//*[@id="MainContent_gvUtilities"]/tbody/tr[52]/td/ul/li['+ table_page + ']/a'))).click()
        print("Navigating to next table page...")
      
      except (TimeoutException, WebDriverException):
        print("Last page reached, moving to the next region...")
        break

   print("No more pages to scrape under %s. Moving to the next region..." %region)

browser.close()
browser.quit() 

1 Answer 1

1

The following calculates the number of pages based on the result count and the known max number of results per page.

It loops through clicking on the appropriate href containing this page number. Where this number is not visible, the raised exception is handled and the initial pagination ellipsis is clicked to reveal the page.

I print the first tr first td, of pages greater than 1, to show that the page has been visited. I also swop out hard-coded waits for wait conditions.

I have used ChromeDriver.

This is to give you a framework to use. I tested it and it ran for all region selections and pages.


import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException, NoSuchElementException
import math

results_per_page = 50
url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome() #FireFox()
browser.get(url)
print("Retriving the site...")

# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']

for region in regions:
   # Select all options from drop down menu
    selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

    print("Now constructing output for: " + region)
    
    # Select table and wait for data to populate
    selectOption.select_by_visible_text(region)
    
    WebDriverWait(browser, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#MainContent_gvUtilities tr > td')))
    num_results = int(browser.find_element_by_id('MainContent_lblqResults').text)
    num_pages = math.ceil(num_results/results_per_page)
    print(f'pages to scrape are: {num_pages}')
    
    for page in range(2, num_pages + 1):
        print(f'visiting page {page}')
        try:
            browser.find_element_by_css_selector(f'.pagination > li > [href*="Page\${page}"]').click()
            WebDriverWait(browser, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#MainContent_gvUtilities tr > td')))
            print(browser.find_element_by_css_selector('#MainContent_gvUtilities tr:nth-child(2) span').text)
        except NoSuchElementException:
            browser.find_element_by_css_selector('.pagination > li > a').click()
        except Exception as e:
            print(e)
            continue
            
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.