0

So I want to scrape a website that uses JavaScript/AJAX to generate additional results as you scroll down the page. I am using Python 3.7 with Selenium Chrome running headless. However, as scraping progresses, you end up with an ever expanding amount of code, which slows down my machine until it is at a standstill. Even simple operations like –

code = driver.page_source

– grow to take several seconds. I ran a test to see how much the codebase had grown, after a few hundred results it had expanded from an initial length of about a half-million characters to 25 million characters – 50 fold! My question is this:

1) Is there some way to have Selenium delete prior code (similar to the way you can delete it in Chrome's "inspect element" mode) to keep the size manageable?

2) Or is there some other simple solution that I'm overlooking?

3
  • 1
    Do you need to use Selenium in the first place? If you can just send the same (or similar) requests that the JavaScript on the page sends, you can skip all the DOM processing in the first place, which should be orders of magnitude faster. Commented Oct 8, 2018 at 0:47
  • Do you know if there is a tutorial that explains how these processes work? I'm proficient in Python, but just starting to learn JavaScript. Commented Oct 8, 2018 at 1:00
  • 1
    Almost certainly, there is no need to do anything with JavaScript at all. Simply open up the Developer toolbar of your favorite web browser (pressing F12 does it in most web browsers), go to the Network tab, and look at the requests that are being sent while you use the website. Commented Oct 8, 2018 at 1:38

1 Answer 1

1

One suggestion would be to look at the javascript which is being run and execute something similar, in python, rather than simply relying on selenium.

I don't know what website you're doing, but sounds like it's doing a series of AJAX calls, loading another page & another page of results (images /posts /whatever).

Reverse engineer the JS -- it's probably doing the same AJAX call over and over, passing in a parameter or two. Figure out how the JS calculates the passed in parameter (is it a timestamp, or ID of "last" element received, etc.)

Then, rather than having selenium do the work, use python requests, doing the equivalent POST. Retrieve the data (likely json or html), parse it for what you need & then repeat.

Depending on the site you're looking at, this can be orders of magnitude faster.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.