JavaScript is bogging down Selenium for Python

Question

So I want to scrape a website that uses JavaScript/AJAX to generate additional results as you scroll down the page. I am using Python 3.7 with Selenium Chrome running headless. However, as scraping progresses, you end up with an ever expanding amount of code, which slows down my machine until it is at a standstill. Even simple operations like –

code = driver.page_source

– grow to take several seconds. I ran a test to see how much the codebase had grown, after a few hundred results it had expanded from an initial length of about a half-million characters to 25 million characters – 50 fold! My question is this:

1) Is there some way to have Selenium delete prior code (similar to the way you can delete it in Chrome's "inspect element" mode) to keep the size manageable?

2) Or is there some other simple solution that I'm overlooking?

Do you need to use Selenium in the first place? If you can just send the same (or similar) requests that the JavaScript on the page sends, you can skip all the DOM processing in the first place, which should be orders of magnitude faster. — phihag
– phihag, Commented Oct 8, 2018 at 0:47
Do you know if there is a tutorial that explains how these processes work? I'm proficient in Python, but just starting to learn JavaScript. — Alex Heebs
– Alex Heebs, Commented Oct 8, 2018 at 1:00
Almost certainly, there is no need to do anything with JavaScript at all. Simply open up the Developer toolbar of your favorite web browser (pressing F12 does it in most web browsers), go to the Network tab, and look at the requests that are being sent while you use the website. — phihag
– phihag, Commented Oct 8, 2018 at 1:38

pbuck · Accepted Answer · 2018-10-08 00:51:41Z

1

One suggestion would be to look at the javascript which is being run and execute something similar, in python, rather than simply relying on selenium.

I don't know what website you're doing, but sounds like it's doing a series of AJAX calls, loading another page & another page of results (images /posts /whatever).

Reverse engineer the JS -- it's probably doing the same AJAX call over and over, passing in a parameter or two. Figure out how the JS calculates the passed in parameter (is it a timestamp, or ID of "last" element received, etc.)

Then, rather than having selenium do the work, use python requests, doing the equivalent POST. Retrieve the data (likely json or html), parse it for what you need & then repeat.

Depending on the site you're looking at, this can be orders of magnitude faster.

answered Oct 8, 2018 at 0:51

pbuck

4,5902 gold badges28 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

JavaScript is bogging down Selenium for Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related