1

I am trying to make a scraping application to scrape Hants.gov.uk and right now I am working on it just clicking the pages instead of scraping. When it gets to the last row on page 1 it just stopped, so what I did was make it click button "Next Page" but first it has to go back to the original URL. It clicks page 2, but after page 2 is scraped it doesn't go to page 3, it just restarts page 2.

Can somebody help me fix this issue?

Code:

import time
import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"

driver = webdriver.Chrome(executable_path=r"C:\Users\Goten\Desktop\chromedriver.exe")
driver.get(url)

driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()

def start():
    elements = driver.find_elements_by_css_selector(".searchResult a")
    links = [link.get_attribute("href") for link in elements]

    result = []
    for link in links:
        if link not in result:
            result.append(link)
        else:
            driver.get(link)
            goUrl = urllib.request.urlopen(link)
            soup = BeautifulSoup(goUrl.read(), "html.parser")
            #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
            for i in range(20):
                pass # Don't worry about all this commented code, it isn't relevant right now
                #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
                #print(table.text)
            #   div = soup.select("div.applicationDetails")
            #   getDiv = div[i].split(":")[1].get_text()
            #   log = open("log.txt", "a")
            #   log.write(getDiv + "\n")
            #log.write("\n")

start()
driver.get(url)

for i in range(5):
    driver.find_element_by_id("ctl00_mainContentPlaceHolder_lvResults_bottomPager_ctl02_NextButton").click()
    url = driver.current_url
    start()
    driver.get(url)
driver.close()

3 Answers 3

2

try this:

import time
# import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"

driver = webdriver.Chrome()
driver.get(url)

driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()

result = []


def start():
    elements = driver.find_elements_by_css_selector(".searchResult a")
    links = [link.get_attribute("href") for link in elements]
    result.extend(links)

def start2():
    for link in result:
        # if link not in result:
        #     result.append(link)
        # else:
            driver.get(link)
            goUrl = urllib.request.urlopen(link)
            soup = BeautifulSoup(goUrl.read(), "html.parser")
            #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
            for i in range(20):
                pass # Don't worry about all this commented code, it isn't relevant right now
                #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
                #print(table.text)
            #   div = soup.select("div.applicationDetails")
            #   getDiv = div[i].split(":")[1].get_text()
            #   log = open("log.txt", "a")
            #   log.write(getDiv + "\n")
            #log.write("\n")


while True:
    start()
    element = driver.find_element_by_class_name('rdpPageNext')
    try:
        check = element.get_attribute('onclick')
        if check != "return false;":
            element.click()
        else:
            break

    except:
        break

print(result)
start2()
driver.get(url)
Sign up to request clarification or add additional context in comments.

9 Comments

Yeah but the code also is required to check through each application too. There are 7 each page
it is checking. i used while loop
use sleep in between loop. as per requirment. i can't run the code right now. but i think this will work fine
i thought you had problem to go through each page. so i solved only that. you have to add your other code to get data from table. you can add that just after the line while True:
tell me if this is working or not. if it is i will explain the logic
|
1

As per the url https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True to click through all the pages you can use the following solution:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = Options()
    options.add_argument("start-maximized")
    options.add_argument("disable-infobars")
    options.add_argument("--disable-extensions")
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get('https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "mainContentPlaceHolder_btnAccept"))).click()
    numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#ctl00_mainContentPlaceHolder_lvResults_topPager div.rdpWrap.rdpNumPart>a"))))
    print(numLinks)
    for i in range(numLinks):
        print("Perform your scrapping here on page {}".format(str(i+1)))
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@id='ctl00_mainContentPlaceHolder_lvResults_topPager']//div[@class='rdpWrap rdpNumPart']//a[@class='rdpCurrentPage']/span//following::span[1]"))).click()
    driver.quit()
    
  • Console Output:

    8
    Perform your scrapping here on page 1
    Perform your scrapping here on page 2
    Perform your scrapping here on page 3
    Perform your scrapping here on page 4
    Perform your scrapping here on page 5
    Perform your scrapping here on page 6
    Perform your scrapping here on page 7
    Perform your scrapping here on page 8
    

3 Comments

Although this is a splendid idea I would like to accomplish this task with mostly my own code, I am just trying to figure it out :) Thank you though
@FeitanPortor We are neither aware about your requirement nor about your usecase. You have raised your question and contributors are trying to help you out in their own capacity. Feel free to use either the code or the logic within :) it would be your choice
I know. This doesn't precisely answer my question. I upvoted earlier
0

hi @Feitan Portor you have written the code absolutely perfect the only reason that you are redirected back to the first page is because you have given url = driver.current_url in the last for loop where it is the url that remains static and only the java script that instigates the next click event so just remove url = driver.current_url and driver.get(url)

and you are good to go i have tested my self also to get the current page that your scraper is in just add this part in the for loop so you will get to know where your scraper is :

ss = driver.find_element_by_class_name('rdpCurrentPage').text
    print(ss)

Hope this solves your confusion

1 Comment

I get an error pastebin.com/jZPCpdjB It manages to get to page 2, but no more

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.