Python web scraping page loop

Question

Appreciate this is been asked many time on here but I cant seem to get it to work for me.

I've written a scraper which successfully scrapes everything I need from the first page of the site. But, I cant figure out how to get it to loop through the various pages.

The url simply increments like this BLAH/3 + 'page=x'

I haven't been learning to code for very long, so any advice would be appreciated!

import requests
from bs4 import BeautifulSoup


url = 'http://www.URL.org/BLAH1/BLAH2/BLAH3'

soup = BeautifulSoup(r.content, "html.parser")

# String substitution for HTML
for link in soup.find_all("a"):
"<a href='>%s'>%s</a>" %(link.get("href"), link.text)

# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})


for item in general_data:
    name = print(item.contents[0].text)
    address = print(item.contents[1].text.replace('.',''))
    care_type = print(item.contents[2].text)

Update:

r = requests.get('http://www.URL.org/BLAH1/BLAH2/BLAH3')

for page in range(10):

    r = requests.get('http://www.URL.org/BLAH1/BLAH2/BLAH3' + 'page=' + page)

soup = BeautifulSoup(r.content, "html.parser")
#print(soup.prettify())


# String substitution for HTML
for link in soup.find_all("a"):
    "<a href='>%s'>%s</a>" %(link.get("href"), link.text)

# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})


for item in general_data:
    name = print(item.contents[0].text)
    address = print(item.contents[1].text.replace('.',''))
    care_type = print(item.contents[2].text)

Update 2!:

import requests
from bs4 import BeautifulSoup

url = 'http://www.URL.org/BLAH1/BLAH2/BLAH3&page='

for page in range(10):

r = requests.get(url + str(page))

soup = BeautifulSoup(r.content, "html.parser")

# String substitution for HTML
for link in soup.find_all("a"):
    print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))

# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})

for item in general_data:
    print(item.contents[0].text)
    print(item.contents[1].text.replace('.',''))
    print(item.contents[2].text)

Take a look in this answer stackoverflow.com/questions/40809017/… If this dowsn't help you let us know. — daniboy000
– daniboy000, Commented Dec 9, 2016 at 14:51
Thanks @furas. This is what I’m looking at now but cannot seem to get it to work? r = requests.get(url+page) r = requests.get('URL.org/BLAH1/BLAH2/BLAH3?page=') # url next page soup = BeautifulSoup(r.content, "html.parser") url = 'URL.org/BLAH1/BLAH2/BLAH3?page=' for page in range(10): # get 10 pages r = requests.get(url+page) — Maverick
– Maverick, Commented Dec 9, 2016 at 15:24

furas · Accepted Answer · 2016-12-09 17:44:55Z

3

To loop pages with page=x you need for loop like this>

import requests
from bs4 import BeautifulSoup

url = 'http://www.housingcare.org/housing-care/results.aspx?ath=1%2c2%2c3%2c6%2c7&stp=1&sm=3&vm=list&rp=10&page='

for page in range(10):

    print('---', page, '---')

    r = requests.get(url + str(page))

    soup = BeautifulSoup(r.content, "html.parser")

    # String substitution for HTML
    for link in soup.find_all("a"):
        print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))

    # Fetch and print general data from title class
    general_data = soup.find_all('div', {'class' : 'title'})

    for item in general_data:
        print(item.contents[0].text)
        print(item.contents[1].text.replace('.',''))
        print(item.contents[2].text)

Every page can be different and better solution needs more inforamtion about page. Sometimes you can get link to last page and then you can use this information instead 10 in range(10)

Or you can use while True to loop and break to leave loop if there is no link to next page. But first you have to show this page (url to real page) in question.

EDIT: example how to get link to next page and then you get all pages - not only 10 pages as in previous version.

import requests
from bs4 import BeautifulSoup

# link to first page - without `page=`
url = 'http://www.housingcare.org/housing-care/results.aspx?ath=1%2c2%2c3%2c6%2c7&stp=1&sm=3&vm=list&rp=10'

# only for information, not used in url
page = 0 

while True:

    print('---', page, '---')

    r = requests.get(url)

    soup = BeautifulSoup(r.content, "html.parser")

    # String substitution for HTML
    for link in soup.find_all("a"):
        print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))

    # Fetch and print general data from title class
    general_data = soup.find_all('div', {'class' : 'title'})

    for item in general_data:
        print(item.contents[0].text)
        print(item.contents[1].text.replace('.',''))
        print(item.contents[2].text)

    # link to next page

    next_page = soup.find('a', {'class': 'next'})

    if next_page:
        url = next_page.get('href')
        page += 1
    else:
        break # exit `while True`

edited Dec 9, 2016 at 17:44

answered Dec 9, 2016 at 15:11

furas

149k12 gold badges121 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

furas Over a year ago

better put this in question - it can be more readable and everyone will see it (and can answer)

Maverick Over a year ago

Thanks @furas. This is what I’m looking at now but cannot seem to get it to work? r = requests.get(url+page) r = requests.get('URL.org/BLAH1/BLAH2/BLAH3?page=') # url next page soup = BeautifulSoup(r.content, "html.parser") url = 'URL.org/BLAH1/BLAH2/BLAH3?page=' for page in range(10): # get 10 pages r = requests.get(url+page)

furas Over a year ago

I add example which finds link to next page and uses it instead of for-loop

Maverick Over a year ago

I really appreciate your help! I just run that and it is till returning the same results from the first page? :S

Collectives™ on Stack Overflow

Python web scraping page loop

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related