0
from bs4 import BeautifulSoup
import urllib, time
class scrape(object):
    def __init__(self):
        self.urls = ['https://www.onthemarket.com/for-sale/property/wigan/', 'https://www.onthemarket.com/for-sale/property/wigan/?page=1', 'https://www.onthemarket.com/for-sale/property/wigan/?page=2', 'https://www.onthemarket.com/for-sale/property/wigan/?page=3', 'https://www.onthemarket.com/for-sale/property/wigan/?page=4', 'https://www.onthemarket.com/for-sale/property/wigan/?page=6']
        self.telephones = []
    def extract_info(self):
        for link in self.urls:
            data = urllib.request.urlopen(link).read()
            soup = BeautifulSoup(data, "lxml")
            for tel in soup.findAll("span", {"class":"call"}):
                self.telephones.append(tel.text.strip())
            time.sleep(1)
        return self.telephones

to = scrape()
print(to.extract_info())

What is wrong? This code is hanging after second website. It should extract phone numbers from each webpage in list self.urls

7
  • 1
    If you are getting any error please post it as well Commented Dec 4, 2017 at 9:36
  • I've tried your code, everything works fine. [Finished in 9.3s] Commented Dec 4, 2017 at 9:41
  • There is no error. python shell is doing work, but not returning anything. I use Spyder with Python 3.6. I am waiting more than 5 min and happens nothing. Commented Dec 4, 2017 at 9:45
  • Are you sure that it is not a network problem? Is url that is being processed accessible at the moment on hanging? Commented Dec 4, 2017 at 9:49
  • ventik, yes a network problem is possible, but in my case first two sites are scraped correctly, but after that is hanging without a reason. ventik what python IDE have you used? Commented Dec 4, 2017 at 9:57

1 Answer 1

2

All you need to do is put a headers in your request parameter and make a go. Try this:

from bs4 import BeautifulSoup
import requests, time

class scrape(object):

    def __init__(self):
        self.urls = ['https://www.onthemarket.com/for-sale/property/wigan/', 'https://www.onthemarket.com/for-sale/property/wigan/?page=1', 'https://www.onthemarket.com/for-sale/property/wigan/?page=2', 'https://www.onthemarket.com/for-sale/property/wigan/?page=3', 'https://www.onthemarket.com/for-sale/property/wigan/?page=4', 'https://www.onthemarket.com/for-sale/property/wigan/?page=6']
        self.telephones = []

    def extract_info(self):
        for link in self.urls:
            data = requests.get(link,headers={"User-Agent":"Mozilla/5.0"}) #it should do the trick
            soup = BeautifulSoup(data.text, "lxml")
            for tel in soup.find_all("span",{"class":"call"}):
                self.telephones.append(tel.text.strip())
            time.sleep(1)
        return self.telephones

crawl = scrape()
print(crawl.extract_info())
Sign up to request clarification or add additional context in comments.

2 Comments

Btw, in your case you found two sites working and the rest are not but in my case what i was having is a blank list. However, after putting headers in the request parameter, I got it working flawlessly @FootAdministration.
Thank you Shahin it worked for me! Great answer! Have a nice day!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.