0

I'm scraping this website using Python and Selenium. But it currently only scrapes the first 10 page for the month of July, it turns the page number of the previous sibling of the next button into int and clicks next number_of_pages - 1 however after it gets to page 10 it stops.

URL - https://planning.adur-worthing.gov.uk/online-applications/search.do?action=monthlyList

Can anyone help me to get it to scrape all the pages?

def pagination( driver ):
   data = []
   last_element = driver.find_element_by_xpath('//a[ contains( concat( " ", normalize-space( @class ), " "), " next ") ]/preceding-sibling::a[1]')
   if last_element is None:
    number_of_pages = 1
else:
    number_of_pages = int( last_element.text )
# data = [ getData( driver ) ]
data.extend(getData(driver))
for i in range(number_of_pages - 1):
    driver.find_element_by_xpath('//a[ contains( concat( " ", normalize-space( @class ), " "), " next ") ]').click()
    data.extend( getData( driver ) )
    time.sleep(1)
return data
4
  • can you print number_of_pages before the for loop? I suspect that because you convert the text of the last element to int, it just shows 10 (even though there are more pages) Commented Aug 23, 2018 at 14:11
  • I just tested this out your right it only turns 10 into int it doesnt carry on for the other pages Commented Aug 23, 2018 at 14:13
  • as per your given link [URL - planning.adur-worthing.gov.uk/online-applications/… . I am seeing only 10 pages. Commented Aug 23, 2018 at 14:18
  • are you checking the month july if you are press page 10 and more should come up Commented Aug 23, 2018 at 14:23

3 Answers 3

1

number_of_pages seems to have the value of 10.

Find another way to find out how many pages there are.

You can use a while loop that checks if the "next page" button is available, and if it is, keep going, else- that is the last page.

like this:

while next_button_element.is_displayed():
    // Do the action that is currently in the for loop
Sign up to request clarification or add additional context in comments.

4 Comments

Do you mean like this: next_button_element = driver.find_element_by_xpath('//a[ contains( concat( " ", normalize-space( @class ), " "), " next ") ]') while next_button_element.is_displayed(): driver.find_element_by_xpath('//a[ contains( concat( " ", normalize-space( class ), " "), " next ") ]').click() data.extend( getData( driver ) ) time.sleep(1) return data
Use more simple selectors: next button css selector driver.find_elements_by_css_selector('a.next')
No need to find the element twice. find it once and store it in a variable and then use is_displayed() or click() function on it
This doesnt work either next_button_element = driver.find_elements_by_css_selector('a.next') while next_button_element.is_displayed(): next_button_element.click() data.extend( getData( driver ) ) time.sleep(1) return data
1

Code you can use:

while True:
    data.extend(getData(driver))
    try:
        driver.find_element_by_css_selector('a.next').click()
    except:
        break

6 Comments

got this error next_button_by = (By.CSS_SELECTOR, "a.next") NameError: global name 'By' is not defined
add from selenium.webdriver.common.by import By
if driver.find_elements(next_button_by)==0: this line gave the error: WebDriverException: Message: invalid argument: 'using' must be a string
Missed get count .count or use len(driver.find_elements(next_button_by))
it works thank you however how do i get it to stop printing this error when a.next doesn't exists NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"a.nex t"}
|
0

Look, I understand you took the idea of calculating the total number of pages from my answer for a previous question of yours. In the previous case since the last page number was directly available to us, it worked but that's not the case here.

Solution :

Although the number of pages is not directly available but the total number of entries is -

Image displaying the total number of entries

Now, as you can see in the above screenshot for the case of July this number is 174. Assuming you put the pagination length(the number of entries in a single page) as default 10, the number of pages should be 18 (17 pages of 10 entries each and one extra page for remaining 4 entries).

So, the logic of calculating the number of pages should be simple. If you somehow got this total number of entries in total_entries variable, the number of pages should be(taken from this:

number_of_pages = (total_entries/10) + 1

Python by default returns the lower bound integer by division operator so 174/10 will return 17 and adding +1 will return 18. So there you have it- 18 as the number of pages.

Now, to extract the total number of entries. You use the below locator to find the <span> element holding that.

driver.find_element_by_xpath('//span[@class='showing']')

But this element contains text like this - Showing 1-10 of 174. You need only the 174 part from the entire string. To do that, first you extract the string after "of" and then convert it into int.

Algorithm to extract the total number of entries as int from the text:

showing_text = driver.find_element_by_xpath("//span[@class='showing']").text    #Showing 1-10 of 174
number_of_entries_text = showing_text.split("of",1)[1]        # 174 as text
number_of_entries = int( re.findall(r'\d+',number_of_entries_text)[0])  #174 as int
number_of_pages = (number_of_entries/10) + 1   #18

Final code:

def pagination( driver ):
   data = []
   last_element = driver.find_element_by_xpath("//span[@class='showing']")
   if last_element is None:
      number_of_pages = 1
   else:
      showing_text = driver.find_element_by_xpath("//span[@class='showing']").text              number_of_entries_text = showing_text.split("of",1)[1]        
      number_of_entries = int( re.findall(r'\d+',number_of_entries_text)[0])  
      number_of_pages = (number_of_entries/10) +1   

   for i in range(number_of_pages - 1):
       driver.find_element_by_xpath('//a[ contains( concat( " ", normalize-space( @class ), " "), " next ") ]').click()
       time.sleep(1)

Note:

I think my solution is better since you don't have to repeatedly check for any element to be available or to catch any exceptions. You just directly get the number of pages and you click the next button that many times.

3 Comments

math.cecil rounds it down to the smallest integer so that means it would skip page 18
if there is a way to get it to also go to page 18 that would be great
@AbdulJamac I am sorry I made it more complicated than it was necessary. Python by default returns lower bound int on division operator so there is no need of math.ceil. Check my edited answer. Just dividing the total number of entries by 10 and adding 1 to that will do the trick. And yes that way, it will go all the way to the end i.e. 18.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.