Scraping multiple webpages with Python

Question

from bs4 import BeautifulSoup
import urllib, time
class scrape(object):
    def __init__(self):
        self.urls = ['https://www.onthemarket.com/for-sale/property/wigan/', 'https://www.onthemarket.com/for-sale/property/wigan/?page=1', 'https://www.onthemarket.com/for-sale/property/wigan/?page=2', 'https://www.onthemarket.com/for-sale/property/wigan/?page=3', 'https://www.onthemarket.com/for-sale/property/wigan/?page=4', 'https://www.onthemarket.com/for-sale/property/wigan/?page=6']
        self.telephones = []
    def extract_info(self):
        for link in self.urls:
            data = urllib.request.urlopen(link).read()
            soup = BeautifulSoup(data, "lxml")
            for tel in soup.findAll("span", {"class":"call"}):
                self.telephones.append(tel.text.strip())
            time.sleep(1)
        return self.telephones

to = scrape()
print(to.extract_info())

What is wrong? This code is hanging after second website. It should extract phone numbers from each webpage in list self.urls

I've tried your code, everything works fine. [Finished in 9.3s] — ventik
– ventik, Commented Dec 4, 2017 at 9:41
There is no error. python shell is doing work, but not returning anything. I use Spyder with Python 3.6. I am waiting more than 5 min and happens nothing. — cat
– cat, Commented Dec 4, 2017 at 9:45
Are you sure that it is not a network problem? Is url that is being processed accessible at the moment on hanging? — ventik
– ventik, Commented Dec 4, 2017 at 9:49
ventik, yes a network problem is possible, but in my case first two sites are scraped correctly, but after that is hanging without a reason. ventik what python IDE have you used? — cat
– cat, Commented Dec 4, 2017 at 9:57

SIM · Accepted Answer · 2017-12-04 10:53:08Z

2

All you need to do is put a headers in your request parameter and make a go. Try this:

from bs4 import BeautifulSoup
import requests, time

class scrape(object):

    def __init__(self):
        self.urls = ['https://www.onthemarket.com/for-sale/property/wigan/', 'https://www.onthemarket.com/for-sale/property/wigan/?page=1', 'https://www.onthemarket.com/for-sale/property/wigan/?page=2', 'https://www.onthemarket.com/for-sale/property/wigan/?page=3', 'https://www.onthemarket.com/for-sale/property/wigan/?page=4', 'https://www.onthemarket.com/for-sale/property/wigan/?page=6']
        self.telephones = []

    def extract_info(self):
        for link in self.urls:
            data = requests.get(link,headers={"User-Agent":"Mozilla/5.0"}) #it should do the trick
            soup = BeautifulSoup(data.text, "lxml")
            for tel in soup.find_all("span",{"class":"call"}):
                self.telephones.append(tel.text.strip())
            time.sleep(1)
        return self.telephones

crawl = scrape()
print(crawl.extract_info())

answered Dec 4, 2017 at 10:53

SIM

22.5k6 gold badges45 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

SIM Over a year ago

Btw, in your case you found two sites working and the rest are not but in my case what i was having is a blank list. However, after putting headers in the request parameter, I got it working flawlessly @FootAdministration.

cat Over a year ago

Thank you Shahin it worked for me! Great answer! Have a nice day!

Collectives™ on Stack Overflow

Scraping multiple webpages with Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related