-1

I'm trying to scrape data from a multi-page table that is returned after filling out a form. The URL of the original form in question is https://ndber.seai.ie/Pass/assessors/search.aspx

From https://kaijento.github.io/2017/05/04/web-scraping-requests-eventtarget-viewstate/ I get the code that extracts the hidden variables from the blank form that are then sent with the POST request to get the data

import requests
from bs4 import BeautifulSoup

url='https://ndber.seai.ie/PASS/Assessors/Search.aspx'

with requests.session() as s:
    s.headers['user-agent'] = 'Mozilla/5.0'
    r    = s.get(url)
    soup = BeautifulSoup(r.content, 'html5lib')
    target = 'ctl00$DefaultContent$AssessorSearch$gridAssessors$grid_pager'

    # unsupported CSS Selector 'input[name^=ctl00][value]'
    data = { tag['name']: tag['value'] 
        for tag in soup.select('input[name^=ctl00]') if tag.get('value')
    }
    state = { tag['name']: tag['value'] 
        for tag in soup.select('input[name^=__]')
    }
    data.update(state)
    data['__EVENTTARGET'] = ''
    data['__EVENTARGUMENT'] = ''
    print(data)
    r = s.post(url, data=data)
    new_soup = BeautifulSoup(r.content, 'html5lib')
    print(new_soup)

The initial .get goes fine, I get the html for the blank form, and I can extract the parameters into data.

However the .post returns a html page that indicates an error has occurred with no useful data.

Note that the results are split over multiple pages and when you go from page to page the following parameters are given values

data['__EVENTTARGET'] = 'ctl00$DefaultContent$AssessorSearch$gridAssessors$grid_pager' 
data['__EVENTARGUMENT'] = '1$n' # where n is the number of the age to retrieve

In the code above I'm initially just trying to get the first page of results and then once that's working I'll work out the loop to go through all the results and join them.

Does anyone have an idea of how to handle such as case ?

Thanks / Colm

6
  • Having continued to dig I find stackoverflow.com/questions/12645849/… which then leads me to search for aspx specific web scraping with python and then I find kaijento.github.io/2017/05/04/… which I'm trawling/crawling through right now. Commented Nov 11, 2020 at 19:37
  • The comment above is no longer relevant, it was from the original post which I have now updated. Commented Mar 23, 2021 at 12:43
  • 1
    You need to use the value of ctl00_DefaultContent_AssessorSearch_captcha within data parameters in order to send the same with post requests to fetch the required content. Turn out that the value of that aforementioned key is dynamic and I highly doubt you can find it in page source. Commented Mar 23, 2021 at 12:46
  • Dang ! It does look as though it's doing that AJAX'y thing to get that value which changes at each reload of the page. It almost makes me want to do a copy/paste on the 27 pages of results every day just to spite them. #somuchforopendata Thanks for the pointer though @SIM Commented Mar 23, 2021 at 15:58
  • Took me ~10 minutes to copy paste it all, not sure if I can do this daily ! Risk of Repetitive Stress Injury. Commented Mar 23, 2021 at 16:27

1 Answer 1

2
+100

You can get the tabular content traversing multiple pages from that website using requests module. In that case, you have to send multiple post requests with appropriate parameters to access the content.

Unlike other parameters, there is one key ctl00$DefaultContent$AssessorSearch$captcha whose value is generated dynamically and not present in page source.

However, you can still fetch the value of that key using this requests_html library. Fyi, requests and requests_html libraries are of the same author. You just need to use this function get_captcha_value() once to get the value of captcha and then you can reuse the same value till the end.

The script below currently fetches all the names from all the pages. You can modify the selector to get other fields of your interest.

This is how you can go:

import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession

link = 'https://ndber.seai.ie/Pass/assessors/search.aspx'

payload = {
    'ctl00$DefaultContent$AssessorSearch$dfSearch$Name': '',
    'ctl00$DefaultContent$AssessorSearch$dfSearch$CompanyName': '',
    'ctl00$DefaultContent$AssessorSearch$dfSearch$County': '',
    'ctl00$DefaultContent$AssessorSearch$dfSearch$searchType': 'rbnDomestic',
    'ctl00$DefaultContent$AssessorSearch$dfSearch$Bottomsearch': 'Search'
}

page = 1

def get_captcha_value():
    with HTMLSession() as session:
        r = session.get(link)
        r.html.render(sleep=5)
        captcha_value = r.html.find("input[name$='$AssessorSearch$captcha']",first=True).attrs['value']
        return captcha_value

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (WindowMozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload['__VIEWSTATE'] = soup.select_one("#__VIEWSTATE")['value']
    payload['__VIEWSTATEGENERATOR'] = soup.select_one("#__VIEWSTATEGENERATOR")['value']
    payload['__EVENTVALIDATION'] = soup.select_one("#__EVENTVALIDATION")['value']
    payload['ctl00$forgeryToken'] = soup.select_one("#ctl00_forgeryToken")['value']
    payload['ctl00$DefaultContent$AssessorSearch$captcha'] = get_captcha_value()
    
    while True:
        res = s.post(link,data=payload)
        soup = BeautifulSoup(res.text,"lxml")
        if not soup.select_one("table[id$='gridAssessors_gridview'] tr[class$='RowStyle']"): break
        for items in soup.select("table[id$='gridAssessors_gridview'] tr[class$='RowStyle']"):
            _name = items.select_one("td > span").get_text(strip=True)
            print(_name)

        page+=1
        payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
        payload.pop('ctl00$DefaultContent$AssessorSearch$dfSearchAgain$Feedback')
        payload.pop('ctl00$DefaultContent$AssessorSearch$dfSearchAgain$Search')
        payload['__EVENTTARGET'] = 'ctl00$DefaultContent$AssessorSearch$gridAssessors$grid_pager'
        payload['__EVENTARGUMENT'] = f'1${page}'
Sign up to request clarification or add additional context in comments.

6 Comments

Oh, this looks good. Give me a chance to try it out. Thanks @SIM
It gives an error message "RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead." originating from the line "r.html.render(sleep=5)" line. I'll give it a go.
BTW, I'm running the code in kaggle.com, I wonder if that context makes a difference ?
I don't know why you encountered that error at your end while executing the script. I thought to give you a video demo as a proof of concept how it performs on my end anyway.
OK, I'll test it on my local machine, but it sure looks good from the video.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.