Scraping a webpage that requires inputs and recaptcha in Python

Question

I'm trying to scrape a website that provides individual access to court cases in New Jersey county courts. I'm having a lot of trouble figuring out how to start though. I've scraped quite a few websites before but I've usually been able to start by adapting the URL to pass through the search parameters. However, when I access this data the URL does not change so I'm at a bit of a loss.

Additionally, there is a test for me to prove that I am not a Robot (which occasionally turns into a ReCaptcha).

On the website linked above, say, for example, the inputs would be:

Case County==Bergen, Docket Type==Landlord Tenant (LT), Docket Number==000001, and Docket Year==19.

I would then like to be able to extract the Defendant Name or anything from the subsequent page.

Does anyone have any advice on how I should proceed with this?

Thanks in advance

pbuck · Accepted Answer · 2019-11-05 16:01:27Z

1

Websites which "require input" can be scraped using Selenium, which evaluates the javascript: your python code then executes the page more as a "user" (click here, type there). It's slow.

Alternatively, if you look at the page details, you may see what happens to input, and simply execute the resulting GET or POST url properly formed (For example, Forms, often, will do a POST with the parameters: Look at the code and figure out what parameters get posted and to what URL, and then in python, execute that POST code -- you'll probably need a cookiejar to maintain session info.

HOWEVER As a website maintainer, my advice to you is to not attempt to scrape this site: it doesn't want to be scraped & repeated attempts only escalate defensive activities on the part of the website owner. You may also be violating usage policy, state and/or federal laws.

Instead, look for an alternative API, or alternative source. (NJ Courts may have an alternative API, designed for computer usage: send them an email!)

answered Nov 5, 2019 at 16:01

pbuck

4,5902 gold badges28 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

C.Robin Over a year ago

Hi pbuck, Thanks for this advice, both technical and strategic. I'm going to reach out to NJ and see if they have an API as a result, and agree that it's likely not wise to try and scrape the site.

Collectives™ on Stack Overflow

Scraping a webpage that requires inputs and recaptcha in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related