Parsing website with python

Question

I want to scrape information off this page: https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchDetail.do?id=JOB-2015-0321370

However, I have trouble parsing it using python. I am not sure what is the issue as I am not familiar with html. Could it be something to do with the shadow root I see in the html? If so, how do I get over it?

url = 'https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchDetail.do?id=JOB-2015-0321370'
hdr = {'User-Agent':'Mozilla/5.0'}
while True:
    req = urllib2.Request(url,headers=hdr)
    try:
        page = urllib2.urlopen(req)
    except:
        print("Exception ConnectionError was caught, retrying requests...")
        time.sleep(5)
    else:
        break
content = page.read()
tree = html.fromstring(content)

jobTitle = tree.xpath('//div[@class="jobDes"]/h3/text()')

Thanks.

Are you getting the correct html, or is it blocking you for using a scraper? I tried it and after a couple of attempts it started to return a page saying Hello, I am a java script test analytics page — Open AI - Opting Out
– Open AI - Opting Out, Commented Sep 4, 2015 at 9:04

gtlambert · Accepted Answer · 2015-09-04 09:04:52Z

1

You can't scrape the desired job description content because, as you suggest, it is part of an <iframe> tag. The content of the iframe is set using JavaScript just after the page loads, and is therefore not returned as part of your page = urllib2.urlopen(req) request. To scrape content from an iFrame you will need to use a browser automation module such as Selenium http://docs.seleniumhq.org/docs/03_webdriver.jsp

answered Sep 4, 2015 at 9:04

gtlambert

12k2 gold badges32 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Seamus Lam Over a year ago

I was afraid that I need to use Selenium (not familiar with it). But thanks for your answer.

gtlambert Over a year ago

Selenium takes a little bit of learning but is OK once you are up and running. The next problem will then be to tackle 'headless browsing' - so that you can automate a browser without it actually displaying on your screen.

Collectives™ on Stack Overflow

Parsing website with python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related