0

I am trying to scrape school names from the following url: https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1.

I want to scrape 10 pages, hence the for loop. I have never used beautifulsoup before and the documentation hasn't solved my problem. Ultimately, I want to scrape the since that's where the school names reside. Below is the small amount of code I have. Any help would be extremely helpful! Thanks!

import bs4 as bs
import requests

numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

names = []
for number in numbers:
    resp = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page='+number)
    soup = bs.BeautifulSoup(resp.text, "lxml")
    school_names = soup.find('div', {'class':'"search-results"'})
    for school_name in school_names:
        school = school_name.find('h2')
        if school:
            print (school.text)
5
  • Whats your problem / error? Commented Feb 19, 2020 at 17:48
  • The problem I have seen is a 403 Forbidden, is it caused by the User-Agent? Commented Feb 19, 2020 at 17:53
  • 1
    I added print(resp.text) right after the request and got <head><title>403 Forbidden</title></head> so that's your first problem. You will need to read up on authenticating with requests. I shouldn't need to mention, but, don't post your user/password here if you need more help! Commented Feb 19, 2020 at 17:54
  • 1
    @CCebrian had a great point. I ran resp = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page='+number, headers={"user-agent":"Mozilla/5.0"}) and got your web page. On to the next issue with the code... Commented Feb 19, 2020 at 17:59
  • 1
    This time I got Access to this page has been denied because we believe you are using automation tools to browse the website. . Ouch! Because its true! You'll need to research how to defeat that. In the meantime, you could bring up the page in your browser, save it, and practice your web scraping on the file. Commented Feb 19, 2020 at 18:04

2 Answers 2

2

Try this with passing the headers. Using https://curl.trillworks.com/ as a helper, I get:

import requests

headers = {
    'authority': 'fonts.gstatic.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
    'sec-fetch-dest': 'font',
    'accept': '*/*',
    'sec-fetch-site': 'cross-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-user': '?1',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': '_pxhd=120bcbd3ded2e33c1496a0ff505f52a169b1f9c1db59a881c1cd00495b9442ee:62dfdf81-5341-11ea-95d7-e144631f0943; xid=6fef7398-e61d-46d2-be72-ee8e8fecc13d; navigation=%7B%22location%22%3A%7s%22%3A%7B%22colleges%22%3A%22%2Fs%2Findiana%2F%22%2C%22graduate-schools%22%3A%22%2Fs%2Findiana%2F%22%2C%22k12%22%3A%22%2Fs%2Findiana%2F%22%2C%22places-to-live%22%3A%22%2Fs%2Findiana%2F%22%2C%22places-to-work%22%3A%22%2Fs%2Findiana%2F%22%7D%7D; experiments=%5E%5E%5E%24%5D; recentlyViewed=entityHistory%7CsearchHistory%7CentityName%7CIndiana%7CentityGuid%7Cad8b4b4c-f8d2-4015-8b22-c0f002a720bb%7CentityType%7CState%7CentityFragment%7Cindiana%5E%5E%5E%240%7C%40%5D%7C1%7C%40%242%7C3%7C4%7C5%7C6%7C7%7C8%7C9%5D%5D%5D; hintSeenLately=second_hint',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
    'Sec-Fetch-Dest': 'image',
    'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-Mode': 'no-cors',
    'Referer': 'https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1',
    'Accept-Language': 'en-US,en;q=0.9',
    'x-client-data': 'CI+2yQEIorbJAQjBtskBCKmdygEIy67KAQi8sMoBCJa1ygEIm7XKAQjstcoBCI66ygEIsL3KARirpMoB',
    'referer': 'https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,600,700',
    'origin': 'https://www.niche.com',
    'Origin': 'https://www.niche.com',
}

params = (
    ('page', '1'),
)

response = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/', headers=headers, params=params)

#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1', headers=headers)

This gives me a 200 now and not a 403. The above headers are verbose of course (I copied this from my browser), you could probably use trial-and-error to see which headers are actually required (I'm guessing it's only a handful) to guarantee a 200 OK.

Sign up to request clarification or add additional context in comments.

Comments

1

The webpage you are trying to scrape has CAPTCHA which makes it difficult to collect data. Take a look at this link:

https://sqa.stackexchange.com/questions/17022/how-to-fill-captcha-using-test-automation

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.