1

I am trying to scrape a table which is being generated through javascript but I am struggling. My code so far is:

driver = webdriver.Chrome();

driver.get("https://af.ktnlandscapes.com/")

# get table -- first wait for table to fully load
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//*[@id='list-view']/tbody/tr")))
table = driver.find_element_by_xpath("//*[@id='list-view']")

# get rows
rows = table.find_elements_by_xpath("tbody/tr")

# iterate rows and get cells
for row in rows:

    # get cells
    print (row.get_attribute("listing"))

I want to scrape the "listing=" numbers within the table. I am not sure how to extract the listing numbers and I am struggling to understand how to force the page to open the rest of the rows within the table as they only load when you scroll down the table a bit.

I am interested in these listing numbers

6
  • "There are 279 unique listings that match your search" maybe you can get this number? Commented Jan 21, 2020 at 10:28
  • I want the actual listing numbers within the html though Commented Jan 21, 2020 at 10:33
  • for row in rows: print( row.get_attribute("listing") ) ? Commented Jan 21, 2020 at 10:45
  • This worked fantastically! My problem is now that some of the rows only load once you have scrolled the table a bit... Commented Jan 21, 2020 at 10:47
  • for scrolling you may have to search JavaScript code which can be used in driver.execute("javascript_code") Commented Jan 21, 2020 at 10:49

2 Answers 2

5

Try to use below code:

driver = webdriver.Chrome()
driver.get("https://af.ktnlandscapes.com/")

# get table -- first wait for table to fully load
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//*[@id='list-view']/tbody/tr")))
table = driver.find_element_by_xpath("//*[@id='list-view']")

get_number = 0
while True:
    count = get_number
    rows = table.find_elements_by_xpath("tbody/tr[@class='list-view-listing']")
    driver.execute_script("arguments[0].scrollIntoView();", rows[-1])  # scroll to last row
    get_number = len(rows)
    print(get_number)
    time.sleep(1)
    if get_number == count:
        break

Output:

20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
339
339

It's actually 339 rows queried in web console. enter image description here

Sign up to request clarification or add additional context in comments.

Comments

2

This is probably simpler to do using requests. If you inspect the page in Chrome/Firefox, as you scroll list area, it sends GET requests for more data. The endpoint is: /list-view-load.php?landscape_id=31&landscape_nid=33192&region=All&category=All&subcategory=All&search=&custom1=&custom2=&custom3=&custom4=&custom5=&offset=20, with the offset increasing by 20 for each request.

You can imitate this via:

import requests
from lxml import html

sess = requests.Session()
url = ('https://af.ktnlandscapes.com/sites/all/themes/landscape_tools/functions'
       '/list-view-load.php?landscape_id=31&landscape_nid=33192&region=All&'
       'category=All&subcategory=All&search=&custom1=&custom2=&custom3=&'
       'custom4=&custom5=&offset={offset}')

gets = []
for i in range(50):
    data = sess.get(url.format(offset=20*i)).json().get('data')
    if not data:
        break
    gets.append(data)
    print(f'\rfinished request {i}', end='')
else:
    print('There is more data!! Increase the range.')

listings = []
for g in gets:
    h = html.fromstring(g)
    listings.extend(h.xpath('tr/@listing'))

print('Number of listings:', len(listings))
# prints:
Number of listings: 339

listings
# returns
['91323', '91528', '91282', '91529', '91572', '91356', '91400', '91445',
 '91373', '91375', '91488', '91283', '91294', '91324', '91423', '91325',
 '91475', '91415', '91382', '91530', '91573', '91295', '91326', '91424',
 ...
 '91568', '91592', '91613', '91569', '91593', '91594', '91570', '91352',
 '91414', '91486', '91353', '91304', '91311', '91354', '91399', '91602',
 '91571', '91610', '103911']

4 Comments

I'm not sure this is quite right, shouldn't there only be 339 results in the list?
The output is abbreviated for space. That is what the ... represents.
Sorry I'm a little confused, could you clarify the get 'url' I should be using?
Sorry, I dropped it's assignment. Updated now

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.