how to add a loop to Python script that scrapes a website

Question

I have a script that scrapes a website. However, I am looking for it to incrementally scrape the websites for a range. So imagine the range is set to 0-999. The code is:

import requests
from bs4 import BeautifulSoup

URL = 'https://www.greekrank.com/uni/1/sororities/'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

uni = soup.find_all('h1', class_='overviewhead')
for title in uni:
    print(title.text)

rows = soup.find_all('div', class_='desktop-view')
for row in rows:
    print(row.text)

It would go to https://www.greekrank.com/uni/1/sororities/ scrape that, then go to https://www.greekrank.com/uni/2/sororities/ scrape that, etc.

Well, just write a for loop? You’ve already written two of them so you should know how to do it. — mkrieger1
– mkrieger1, Commented Apr 5, 2020 at 12:01

Badgy · Accepted Answer · 2020-04-05 12:04:20Z

3

Wrap it all in a loop. Also note the URL assignment.

import requests
from bs4 import BeautifulSoup

for x in range(0, 999):
    URL = f'https://www.greekrank.com/uni/{x}/sororities/'
    page = requests.get(URL)

    soup = BeautifulSoup(page.content, 'html.parser')

    uni = soup.find_all('h1', class_='overviewhead')
    for title in uni:
        print(title.text)

    rows = soup.find_all('div', class_='desktop-view')
    for row in rows:
        print(row.text)

answered Apr 5, 2020 at 12:04

Badgy

8194 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Saurav Panda Over a year ago

Also, take care to add some delay after each iteration so that you are not overloading the server with requests!

Aedam Over a year ago

When I run this in IDLE it doesn't seem to show anything. Would you know where this would be getting caught? If the website doesn't exist (like say "12") would it bypass it and keep going or just stick around on that forever?

Aedam Over a year ago

i tried the following but didn't work: import requests from time import sleep from bs4 import BeautifulSoup for x in range(0, 10): try: URL = f'greekrank.com/uni{x}/sororities/' page = requests.get(URL) soup = BeautifulSoup(page.content, 'html.parser') uni = soup.find_all('h1', class_='overviewhead') for title in uni: print(title.text) rows = soup.find_all('div', class_='desktop-view') for row in rows: print(row.text) time.sleep(1) except: pass

Badgy Over a year ago

You could add a check on the status code just after the request. If it's not what you expect (probably 200), then continue.

Collectives™ on Stack Overflow

how to add a loop to Python script that scrapes a website

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related