1

For educational purposes I am trying to scrape this page using lxml and requests in Python.

Specifically I just want to print the research areas of all the professors on the page. This is what I have done till now

import requests
from lxml import html

response=requests.get('http://cse.iitkgp.ac.in/index.php?secret=d2RkOUgybWlNZzJwQXdLc28wNzh6UT09')
parsed_body=html.fromstring(response.content)

for row in parsed_body.xpath('//div[@id="maincontent"]//tr[position() mod 2 = 1]'):
    for column in row.xpath('//td[@class="fcardcls"]/tr[2]/td/font/text()'):        
        print column.strip()    

But it is not printing anything. I was struggling quite a bit with xpaths and was intially using the copy xpath feature in chrome. I followed what was done in the following SO questions/answers and cleaned up my code quite a bit and got rid of ' tbody ' in the xpaths. Still the code returns a blank.

1. Empty List Returned

2. Python-lxml-xpath problem

1 Answer 1

2

First of all, the main content with the desired data inside is loaded from a different endpoint via an XHR request - simulate that in your code.

Here is the complete working code printing names and a list of research areas per name:

import requests
from lxml import html

response = requests.get('http://cse.iitkgp.ac.in/faculty4.php?_=1450503917634')
parsed_body = html.fromstring(response.content)

for row in parsed_body.xpath('.//td[@class="fcardcls"]'):
    name = row.findtext(".//a[@href]/b")
    name = ' '.join(name.split())  # getting rid of multiple spaces

    research_areas = row.xpath('.//*[. = "Research Areas: "]/following-sibling::text()')[0].split(", ")

    print(name, research_areas)

The idea here is use the fact that all "professor blocks" are located in td elements with class="fcardcls". For every block, get the name from the bold link text and research areas from the following string after Research Areas: bold text.

Sign up to request clarification or add additional context in comments.

2 Comments

Your code works perfectly and I understand what you've written, Thanks. Now, I have got a couple of questions: 1. How did you find out the main content page i.e. this one 2. In my code what was the error in the xpath I wrote? They were pointing to the correct element(the research areas) when I checked in chrome's "inspect".
@humblenoob okay, sure - 1. I've just used the browser developer tools and inspected what requests were sent during the page load; 2. your code was overall on the right track - well, at least one thing is that the inner xpath expression had to start with a dot to be context-specific. Hope that the answer helped.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.