Scrape html data using beautifulsoup and Python

Question

I am trying to scrape school names from the following url: https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1.

I want to scrape 10 pages, hence the for loop. I have never used beautifulsoup before and the documentation hasn't solved my problem. Ultimately, I want to scrape the since that's where the school names reside. Below is the small amount of code I have. Any help would be extremely helpful! Thanks!

import bs4 as bs
import requests

numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

names = []
for number in numbers:
    resp = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page='+number)
    soup = bs.BeautifulSoup(resp.text, "lxml")
    school_names = soup.find('div', {'class':'"search-results"'})
    for school_name in school_names:
        school = school_name.find('h2')
        if school:
            print (school.text)

The problem I have seen is a 403 Forbidden, is it caused by the User-Agent? — CCebrian
– CCebrian, Commented Feb 19, 2020 at 17:53
I added print(resp.text) right after the request and got <head><title>403 Forbidden</title></head> so that's your first problem. You will need to read up on authenticating with requests. I shouldn't need to mention, but, don't post your user/password here if you need more help! — tdelaney
– tdelaney, Commented Feb 19, 2020 at 17:54
@CCebrian had a great point. I ran resp = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page='+number, headers={"user-agent":"Mozilla/5.0"}) and got your web page. On to the next issue with the code... — tdelaney
– tdelaney, Commented Feb 19, 2020 at 17:59
This time I got Access to this page has been denied because we believe you are using automation tools to browse the website. . Ouch! Because its true! You'll need to research how to defeat that. In the meantime, you could bring up the page in your browser, save it, and practice your web scraping on the file. — tdelaney
– tdelaney, Commented Feb 19, 2020 at 18:04

David542 · Accepted Answer · 2020-02-19 18:00:10Z

Try this with passing the headers. Using https://curl.trillworks.com/ as a helper, I get:

import requests

headers = {
    'authority': 'fonts.gstatic.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
    'sec-fetch-dest': 'font',
    'accept': '*/*',
    'sec-fetch-site': 'cross-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-user': '?1',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': '_pxhd=120bcbd3ded2e33c1496a0ff505f52a169b1f9c1db59a881c1cd00495b9442ee:62dfdf81-5341-11ea-95d7-e144631f0943; xid=6fef7398-e61d-46d2-be72-ee8e8fecc13d; navigation=%7B%22location%22%3A%7s%22%3A%7B%22colleges%22%3A%22%2Fs%2Findiana%2F%22%2C%22graduate-schools%22%3A%22%2Fs%2Findiana%2F%22%2C%22k12%22%3A%22%2Fs%2Findiana%2F%22%2C%22places-to-live%22%3A%22%2Fs%2Findiana%2F%22%2C%22places-to-work%22%3A%22%2Fs%2Findiana%2F%22%7D%7D; experiments=%5E%5E%5E%24%5D; recentlyViewed=entityHistory%7CsearchHistory%7CentityName%7CIndiana%7CentityGuid%7Cad8b4b4c-f8d2-4015-8b22-c0f002a720bb%7CentityType%7CState%7CentityFragment%7Cindiana%5E%5E%5E%240%7C%40%5D%7C1%7C%40%242%7C3%7C4%7C5%7C6%7C7%7C8%7C9%5D%5D%5D; hintSeenLately=second_hint',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
    'Sec-Fetch-Dest': 'image',
    'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-Mode': 'no-cors',
    'Referer': 'https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1',
    'Accept-Language': 'en-US,en;q=0.9',
    'x-client-data': 'CI+2yQEIorbJAQjBtskBCKmdygEIy67KAQi8sMoBCJa1ygEIm7XKAQjstcoBCI66ygEIsL3KARirpMoB',
    'referer': 'https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,600,700',
    'origin': 'https://www.niche.com',
    'Origin': 'https://www.niche.com',
}

params = (
    ('page', '1'),
)

response = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/', headers=headers, params=params)

#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1', headers=headers)

This gives me a 200 now and not a 403. The above headers are verbose of course (I copied this from my browser), you could probably use trial-and-error to see which headers are actually required (I'm guessing it's only a handful) to guarantee a 200 OK.

MatthewSzurkowski · Accepted Answer · 2020-02-19 18:00:06Z

1

The webpage you are trying to scrape has CAPTCHA which makes it difficult to collect data. Take a look at this link:

https://sqa.stackexchange.com/questions/17022/how-to-fill-captcha-using-test-automation

answered Feb 19, 2020 at 18:00

MatthewSzurkowski

8110 bronze badges

Collectives™ on Stack Overflow

Scrape html data using beautifulsoup and Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related