1

I have a script that scrapes a website. However, I am looking for it to incrementally scrape the websites for a range. So imagine the range is set to 0-999. The code is:

import requests
from bs4 import BeautifulSoup

URL = 'https://www.greekrank.com/uni/1/sororities/'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

uni = soup.find_all('h1', class_='overviewhead')
for title in uni:
    print(title.text)

rows = soup.find_all('div', class_='desktop-view')
for row in rows:
    print(row.text)

It would go to https://www.greekrank.com/uni/1/sororities/ scrape that, then go to https://www.greekrank.com/uni/2/sororities/ scrape that, etc.

1
  • 1
    Well, just write a for loop? You’ve already written two of them so you should know how to do it. Commented Apr 5, 2020 at 12:01

1 Answer 1

3

Wrap it all in a loop. Also note the URL assignment.

import requests
from bs4 import BeautifulSoup

for x in range(0, 999):
    URL = f'https://www.greekrank.com/uni/{x}/sororities/'
    page = requests.get(URL)

    soup = BeautifulSoup(page.content, 'html.parser')

    uni = soup.find_all('h1', class_='overviewhead')
    for title in uni:
        print(title.text)

    rows = soup.find_all('div', class_='desktop-view')
    for row in rows:
        print(row.text)
Sign up to request clarification or add additional context in comments.

4 Comments

Also, take care to add some delay after each iteration so that you are not overloading the server with requests!
When I run this in IDLE it doesn't seem to show anything. Would you know where this would be getting caught? If the website doesn't exist (like say "12") would it bypass it and keep going or just stick around on that forever?
i tried the following but didn't work: import requests from time import sleep from bs4 import BeautifulSoup for x in range(0, 10): try: URL = f'greekrank.com/uni{x}/sororities/' page = requests.get(URL) soup = BeautifulSoup(page.content, 'html.parser') uni = soup.find_all('h1', class_='overviewhead') for title in uni: print(title.text) rows = soup.find_all('div', class_='desktop-view') for row in rows: print(row.text) time.sleep(1) except: pass
You could add a check on the status code just after the request. If it's not what you expect (probably 200), then continue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.