Python lxml xpath no output

Question

For educational purposes I am trying to scrape this page using lxml and requests in Python.

Specifically I just want to print the research areas of all the professors on the page. This is what I have done till now

import requests
from lxml import html

response=requests.get('http://cse.iitkgp.ac.in/index.php?secret=d2RkOUgybWlNZzJwQXdLc28wNzh6UT09')
parsed_body=html.fromstring(response.content)

for row in parsed_body.xpath('//div[@id="maincontent"]//tr[position() mod 2 = 1]'):
    for column in row.xpath('//td[@class="fcardcls"]/tr[2]/td/font/text()'):        
        print column.strip()

But it is not printing anything. I was struggling quite a bit with xpaths and was intially using the copy xpath feature in chrome. I followed what was done in the following SO questions/answers and cleaned up my code quite a bit and got rid of ' tbody ' in the xpaths. Still the code returns a blank.

1. Empty List Returned

2. Python-lxml-xpath problem

alecxe · Accepted Answer · 2015-12-19 05:48:47Z

2

First of all, the main content with the desired data inside is loaded from a different endpoint via an XHR request - simulate that in your code.

Here is the complete working code printing names and a list of research areas per name:

import requests
from lxml import html

response = requests.get('http://cse.iitkgp.ac.in/faculty4.php?_=1450503917634')
parsed_body = html.fromstring(response.content)

for row in parsed_body.xpath('.//td[@class="fcardcls"]'):
    name = row.findtext(".//a[@href]/b")
    name = ' '.join(name.split())  # getting rid of multiple spaces

    research_areas = row.xpath('.//*[. = "Research Areas: "]/following-sibling::text()')[0].split(", ")

    print(name, research_areas)

The idea here is use the fact that all "professor blocks" are located in td elements with class="fcardcls". For every block, get the name from the bold link text and research areas from the following string after Research Areas: bold text.

answered Dec 19, 2015 at 5:48

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

humblenoob Over a year ago

Your code works perfectly and I understand what you've written, Thanks. Now, I have got a couple of questions: 1. How did you find out the main content page i.e. this one 2. In my code what was the error in the xpath I wrote? They were pointing to the correct element(the research areas) when I checked in chrome's "inspect".

alecxe Over a year ago

@humblenoob okay, sure - 1. I've just used the browser developer tools and inspected what requests were sent during the page load; 2. your code was overall on the right track - well, at least one thing is that the inner xpath expression had to start with a dot to be context-specific. Hope that the answer helped.

Collectives™ on Stack Overflow

Python lxml xpath no output

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related