0

I try to scrape a specific element on a website using lxml in Python. Below you can find my code, but there is no output.

    from lxml import html

    webpage = 'http://www.funda.nl/koop/heel-nederland/'
    page = requests.get(webpage)
    tree = html.fromstring(page.content)

    content = '//*[@id="content"]/form/div[2]/div[5]/div/a[8]/text()'
    content = str(tree.xpath(content))
    print content

1 Answer 1

1

It looks that website you are attempting to scrap does not like to be scrapped. They utilize various techniques to detect if request comes from legitimate user or from bot and block access if they think it comes from bot. That's why your xpath does not find anything and that's why you should reconsider whatever you are doing.

If you decide that you want to continue, then the simplest way of fooling this particular website seems to be adding cookies to your requests.

First, obtain cookie string using you real browser:

  1. Open new tab
  2. Open developers tools
  3. Go to "Network" tab in developer tools
  4. If network tab is empty, refresh page
  5. Find request to heel-nederland/ and click it
  6. In Request Headers, you will find cookie string - it is quite long and contains many seemingly-random characters. Copy it

Then, modify your program to use these cookies:

import requests
from lxml import html

webpage = 'http://www.funda.nl/koop/heel-nederland/'
headers = {
        'Cookie': '<string copied from browser>'
        }
page = requests.get(webpage, headers=headers)
tree = html.fromstring(page.content)

selector = '//*[@id="content"]/form/div[2]/div[5]/div/a[8]/text()'
content = str(tree.xpath(selector))
print content
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.