1

I'm trying to scrape real estate data off of this website: example As you can see the relevant content is placed into article tags.

I'm running selenium with phantomjs:

driver = webdriver.PhantomJS(executable_path=PJSpath)

Then I generate the URL in python, because all search results are part of the link, so I can search what I'm looking for in the program without needing to fill out the form.

Before calling

driver.get(engine_link)

I copy engine_link to the clipboard and it opens fine in chrome. Next I wait for all possible redirects to happen:

def wait_for_redirect(wdriver):
    elem = wdriver.find_element_by_tag_name("html")
    count = 0
    while True:
        count += 1
        if count > 5:
            print("Waited for redirect for 5 seconds!")
            return
        time.sleep(1)
        try:
            elem = wdriver.find_element_by_tag_name("html")
        except StaleElementReferenceException:
            return

Now at last I want to iterate over all <article> tags on the current page:

for article in driver.find_elements_by_tag_name("article"):

But this loop never returns anything. The program doesn't find any article tags, I've tried it with xpath and css selectors. Moreover, the articles are enclosed in a section tag, that can't be found either.

Is there a problem with this specific type of tags in Selenium or am I missing something JS related here? At the bottom of the page there are JavaScript templates whose naming suggests that they generate the search results.

Any help appreciated!

1 Answer 1

1

Pretend not to be PhantomJS and add an Explicit Wait (worked for me):

from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# set a custom user-agent
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent

driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get("http://www.seloger.com/list.htm?cp=40250&org=advanced_search&idtt=2&pxmin=50000&pxmax=200000&surfacemin=20&surfacemax=100&idtypebien=2&idtypebien=1&idtypebien=11")

# wait for arcitles to be present
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.TAG_NAME, "article")))

# get articles
for article in driver.find_elements_by_tag_name("article"):
    print(article.text)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.