The Python "Requests" module cannot detect certain HTML link tags

Question

I'm sure this is an easy question for someone who has experience with webpage programming and basic Web Scraping (which I do not).

My goal is to obtain information about the many tutors that Chegg hires, by scraping their "bio" paragraphs. Although I am a novice at web-scraping, I imagine that this will involve coding a scaper that recursively clicks through the tutors' links:

List of Tutors

And scrapes the tutors' bios

Using the Microsoft Edge DOM Explorer, I can detect the tutor's link tag in the page's HTML:

Tutor's HTML link tag

However, when I use Python's "Requests" module to obtain the HTML of the web page, the tutor's link is not there! Strangely, other links on the web page are detected, but none of the tutors' links. The Python code looks like this:

import requests

r = requests.get('www.chegg.com/tutors/online-tutors/')

print r.content

Can someone advise me on this problem, and what I should go about learning (e.g. HTML programming, HTTP Theory, etc) so I will be equipped to handle this project?

r = requests.get('www.chegg.com/tutors/online-tutors/') won't work at all because the URL is missing the http:// prefix. Browsers execute JavaScript code when referenced from a page or included in it. That code could load additional information into a page. — Martijn Pieters
– Martijn Pieters, Commented Jul 2, 2016 at 16:19
Looking at the network tab when loading that page, I see a series of API calls to https://www.chegg.com/tutors/api/v1/subject/?fields=name,id&searchable=true and other links returning JSON data. The info you are looking for is almost certainly contained in those responses. — Martijn Pieters
– Martijn Pieters, Commented Jul 2, 2016 at 16:21
To your first reply, I didn't put 'http://' because Stack Exchange is only allowing me to post 2 links in my question, and I didn't want that to be counted as one of them. In regards to your second reply, if I find a way to decode this JSON data, I will have access to the info that I need? — Millinneo
– Millinneo, Commented Jul 2, 2016 at 16:44
@Millinneo, what you want is in the source so why would you need to mimic json requests? — Padraic Cunningham
– Padraic Cunningham, Commented Jul 2, 2016 at 16:46

Padraic Cunningham · Accepted Answer · 2016-07-02 16:42:08Z

All the data for each expert is inside the div with the expert-list-content class:

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get("https://www.chegg.com/tutors/online-tutors/").content)
for ex in soup.select("div.expert-list-content"):
    print(ex.select_one("div.expert-description").text)

That gives you:

"Tutoring gives me great pleasure because I not only get to feel good about helping others, but my students also gain..."
"I was a teaching assistant as a graduate student in mathematics, and taught several classes as a postdoc. I have been a tutor..."
"I have always been the go-to student for notes, essay proofreading, and math instruction. I have tutored at the Latino..."
"In my senior year of high school, I worked as a Physics Teaching Assistant and through that, I honed skills necessary to..."
"Throughout the past eight years, I have had the incredible opportunity to work closely with over 200 students in..."
"I have worked as a teaching assistant in my college for core disciplinary courses. I have also conducted training sessions on..."
"Scott here. Originally from Tennessee and educated in Cornell University, I've been tutoring/teaching math for 10 years and..."
"I am currently pursuing dual BE Mechanical Engineering and M.Sc Mathematics degrees from BITS Pilani. I have had ample..."
"I am a specialist in language and linguistics, with a particular interest in the history and grammar of the English language..."
"I graduated 7 years before and since then have taught many students on a regular basis in Finance and Mathematics. I have..."

To get the profile links and name:

for ex in soup.select("div.expert-list-content"):
  info = ex.select_one("div.expert-info a")
  print(info.text, info["href"])

Which gives you:

(u'Aleria S.', '/tutors/online-tutors/Aleria-S-371573/')
(u'Douglas Z.', '/tutors/online-tutors/Douglas-Z-568826/')
(u'Carla S.', '/tutors/online-tutors/Carla-S-864918/')
(u'Vinit R.', '/tutors/online-tutors/Vinit-R-2031766/')
(u'Anastasia G.', '/tutors/online-tutors/Anastasia-G-65278/')
(u'Vinay S.', '/tutors/online-tutors/Vinay-S-85533/')
(u'Gunjan G.', '/tutors/online-tutors/Gunjan-G-2695711/')
(u'Scott M.', '/tutors/online-tutors/Scott-M-277743/')
(u'Saumya U.', '/tutors/online-tutors/Saumya-U-890305/')
(u'Ed M.', '/tutors/online-tutors/Ed-M-2895636/')

There is no Javascript involved, if you right click in your browser and choose view source you can see it is all there. If it were dynamically created you would not see it in the source outside Microsoft Edge DOM Explorer. In general, it is always good to add a user-agent.

head = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}
soup = BeautifulSoup(requests.get("https://www.chegg.com/tutors/online-tutors/", headers=head).content)

Collectives™ on Stack Overflow

The Python "Requests" module cannot detect certain HTML link tags

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related