Python - Web scraping using HTML tags

Question

I am trying to scrape a web-page to list out the jobs posted in URL: https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad

Refer to image for details of web-page inspect Web inspect

Following is observed through a web-page inspect:

Each job listed, is in a HTML li with class="jobs-list-item". The Li contains following html tag & data in parent Div within li

data-ph-at-job-title-text="Software Engineer II", data-ph-at-job-category-text="Engineering", data-ph-at-job-post-date-text="2018-03-19T16:33:00".
1st Child Div within parent Div with class="information" has HTML with url href="https://careers.microsoft.com/us/en/job/406138/Software-Engineer-II"
3rd child Div with class="description au-target" within parent Div has short job description

My requirement is to extract below information for each job

Job Title
Job Category
Job Post Date
Job Post Time
Job URL
Job Short Description

I have tried following Python code to scrape the webpage, but unable to extract the required information. (Please ignore the indentation shown in code below)

import requests
from bs4 import BeautifulSoup
def ms_jobs():
url = 'https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad'
resp = requests.get(url)

if resp.status_code == 200:
print("Successfully opened the web page")
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
else:
print("Error")

ms_jobs()

You need to use any browser simulator like selenium to extract the required data from that page because they are generated dynamically. — SIM
– SIM, Commented Jun 24, 2018 at 20:18
Thanks for the suggestion SIM. I do not have any knowledge of Selenium with Python. Can you please point me to some sample working solution which I can tweak. — Pal S
– Pal S, Commented Jun 25, 2018 at 19:23

jlaur · Accepted Answer · 2018-06-24 21:42:24Z

1

If you want to do this via requests you need to reverse engineer the site. Open the dev tools in Chrome, select the networks tab and fill out the form.

This will show you how the site loads the data. If you dig in the site you'll see, that it grabs the data by doing a POST to this endpoint: https://careers.microsoft.com/widgets. It also shows you the payload that the site uses. The site uses cookies so all you have to do is create a session that keeps the cookie, get one and copy/paste the payload.

This way you'll be able to extract the same json-data, that the javascript fetches to populate the site dynamically.

Below is a working example of what that would look like. Left is only to parse out the json as you see fit.

import requests
from pprint import pprint

# create a session to grab a cookie from the site
session = requests.Session()
r = session.get("https://careers.microsoft.com/us/en/")

# these params are the ones that the dev tools show that site sets when using the website form
payload = {
    "lang":"en_us",
    "deviceType":"desktop",
    "country":"us",
    "ddoKey":"refineSearch",
    "sortBy":"",
    "subsearch":"",
    "from":0,
    "jobs":"true",
    "counts":"true",
    "all_fields":["country","state","city","category","employmentType","requisitionRoleType","educationLevel"],
    "pageName":"search-results",
    "size":20,
    "keywords":"",
    "global":"true",
    "selected_fields":{"city":["Hyderabad"],"country":["India"]},
    "sort":"null",
    "locationData":{}
}

# this is the endpoint the site uses to fetch json
url = "https://careers.microsoft.com/widgets"
r = session.post(url, json=payload)
data = r.json()
job_list = data['refineSearch']['data']['jobs']

# the job_list will hold 20 jobs (you can se the parameter in the payload to a higher number if you please - I tested 100, that returned 100 jobs
job = job_list[0]
pprint(job)

Cheers.

answered Jun 24, 2018 at 21:42

jlaur

7406 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Pal S Over a year ago

Thanks jlaur. Your solution works great. I am curious about your comment about wanting to use "Request". I want to use a more simpler option if available. Reverse engineering the site seems to be a complex aspect for my limited knowledge of chrome developer tool. Is there any simpler solution possible? I see @SIM has suggested using selenium.

jlaur Over a year ago

You're welcome. You asked how to do this with requests. This is how you do that. I would do a job like this using requests. If you want to explore how to do this with eg selenium, close this question and ask a new one with that subject. Bare in mind, tho, that using such a solution would be a lot slower so if you're into serious scraping that's imho not the way to go about...

Pal S Over a year ago

Good to know. "Requests" is the way forward. Thnks for the tip. Will keep in mind. Also accepted your answer

Collectives™ on Stack Overflow

Python - Web scraping using HTML tags

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related