0

I have a web scraping program that gets multiple pages, but I have to set the while loop to a number. I want to make a condition that stops the loop once it reaches the last page or recognizes there are no more items to scrape. Assume I don't know how many pages exist. How do I change the while loop condition to make it stop without putting a random number?

import requests
from bs4 import BeautifulSoup
import csv

filename="output.csv"
f=open(filename, 'w', newline="",encoding='utf-8')
headers="Date, Location, Title, Price\n"
f.write(headers)

i=0
while i<5000:
    if i==0:
        page_link="https://portland.craigslist.org/search/sss?query=xbox&sort=date"
    else:
        page_link="https://portland.craigslist.org/search/sss?s={}&query=xbox&sort=date".format(i)
    res=requests.get(page_link)
    soup=BeautifulSoup(res.text,'html.parser')
    for container in soup.select('.result-info'):
        date=container.select('.result-date')[0].text
        try:
            location=container.select('.result-hood')[0].text
        except:
            try:
                location=container.select('.nearby')[0].text 
            except:
                location='NULL'
        title=container.select('.result-title')[0].text
        try:
            price=container.select('.result-price')[0].text
        except:
            price="NULL"
        print(date,location,title,price)
        f.write(date+','+location.replace(","," ")+','+title.replace(","," ")+','+price+'\n')
    i+=120
f.close()
4
  • Hello, it seems you have forgotten to include the question part of your question. All that exists currently is a problem description. Please update your question so that there is something to answer. Commented Dec 7, 2017 at 23:44
  • 1
    use while True and use break to exit when you can't read more pages (try/except) Commented Dec 8, 2017 at 8:38
  • I tried this and it wouldn't recognize the break, it just keeps looping forever even after there are no more items to scrape. Commented Dec 8, 2017 at 16:27
  • break is python command to exit while/for loop. If you use break inside for then it exits only for, not external while - you may have to use running = True ; while running: to run loop, and later you have to set running = False to exit while loop. Commented Dec 8, 2017 at 19:00

1 Answer 1

1

I use while True to run endless loop and break to exit when there is no data

    data = soup.select('.result-info')
    if not data:
        print('END: no data:')
        break

I use module csv to save data so I don't have to use replace(","," ").
It will put text in " " if there is , in text.

s={} can be in any place after ? so I put it at the end to make code more readable.

Portal gives first page even if you use s=0 so I don't have to check i == 0
(BTW: in my code it has more readable name offset)

Full code.

import requests
from bs4 import BeautifulSoup
import csv

filename = "output.csv"

f = open(filename, 'w', newline="", encoding='utf-8')

csvwriter = csv.writer(f)

csvwriter.writerow( ["Date", "Location", "Title", "Price"] )

offset = 0

while True:
    print('offset:', offset)

    url = "https://portland.craigslist.org/search/sss?query=xbox&sort=date&s={}".format(offset)

    response = requests.get(url)
    if response.status_code != 200:
        print('END: request status:', response.status)
        break

    soup = BeautifulSoup(response.text, 'html.parser')

    data = soup.select('.result-info')
    if not data:
        print('END: no data:')
        break

    for container in data:
        date = container.select('.result-date')[0].text

        try:
            location = container.select('.result-hood')[0].text
        except:
            try:
                location = container.select('.nearby')[0].text 
            except:
                location = 'NULL'
        #location = location.replace(","," ") # don't need it with `csvwriter`

        title = container.select('.result-title')[0].text

        try:
            price = container.select('.result-price')[0].text
        except:
            price = "NULL"
        #title.replace(",", " ") # don't need it with `csvwriter`

        print(date, location, title, price)

        csvwriter.writerow( [date, location, title, price] )

    offset += 120

f.close()
Sign up to request clarification or add additional context in comments.

1 Comment

So simple and elegant!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.