1

I'm writing a web scraping script in Ruby that opens a used car website, searches for a make/model of car, loops over the pages of results, and then scrapes the data on each page.

The problem I'm having is that I don't necessarily know the max # of pages at the beginning, and only as I iterate closer to the last few known pages does the pagination increase and reveal more pages.

I've defined cleanpages as an array and populated it with what I know are the available pages when first opening the site. Then I use cleanpages.each do to iterate over those "pages". Each time I'm on a new page I add all known pages back into cleanpages and then run cleanpages.uniq to remove duplicates. The problem seems to be that cleanpages.each do only iterates as many times as its original length.

Can I make it so that within the each do loop, I increase the number of times it will iterate?

2
  • 2
    do you have any code you can add to your question that you've tried already? see stackoverflow.com/help/how-to-ask Commented Nov 9, 2019 at 0:04
  • What is the code you are having trouble with? What trouble do you have with your code? Do you get an error message? What is the error message? Is the result you are getting not the result you are expecting? What result do you expect and why, what is the result you are getting and how do the two differ? Is the behavior you are observing not the desired behavior? What is the desired behavior and why, what is the observed behavior, and in what way do they differ? Please, provide a minimal reproducible example. Commented Nov 9, 2019 at 8:30

1 Answer 1

4

Rather than using Array#each, try using your array as a queue. The general idea is:

queue = initial_pages
while queue.any?
  page = queue.shift
  new_pages = process(page)
  queue.concat(get_unprocessed_pages(new_pages))
end

The idea here is that you just keep taking items from the head of your queue until it's empty. You can push new items into the end of the queue during processing and they'll be processed correctly.

You'll want to be sure to remove pages from new_pages which are already in the queue or were already processed.

You could also do it by just keeping your array data structure, but manually keep a pointer to the current element in your list. This has the advantage of maintaining a full list of "seen" pages so you can remove them from your new_pages list before appending anything remaining to the list:

index = 0
queue = initial_pages
while true do
  page = queue[index]
  break if page.nil?
  index += 1
  new_pages = get_new_pages(page) - queue
  queue.concat(new_pages)
end
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.