Scraping an updating JavaScript page in Python

Question

I've been working on a research project that is looking to obtain a list of reference articles from the Brazil Hemeroteca (The desired page reference: http://memoria.bn.br/DocReader/720887x/839, needs to be collected from two hidden elements on the following page: http://memoria.bn.br/DocReader/docreader.aspx?bib=720887x&pasta=ano%20189&pesq=Milho). I asked a question a few weeks back that was answered and I was able to get things running well in regards to that, but now I've hit a new snag and I'm not exactly sure how to get around it.

The problem is that after the first form is filled in, the page redirects to a second page, which is a JavaScript/AJAX enabled page which I need to spool through all of the matches, which is done by means of clicking a button at the top of the page. The problem I'm encountering is that when clicking the next page button I'm dealing with elements on the page that are updating, which leads to Stale Elements. I've tried to implement a few pieces of code to detect when this "stale" effect occurs to indicate the page has changed, but this has not provided much luck. Here is the code I've implemented:

import urllib
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time

saveDir = "C:/tmp"

print("Opening Page...")

browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)

print("Searching for elements")

fLink = ""
fails = 0

frame_ref = browser.find_elements_by_tag_name("iframe")[0]
iframe = browser.switch_to.frame(frame_ref)
journal = browser.find_element_by_id("PeriodicoCmb1_Input")

search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"

xpath_form = "//input[@name=\'PesquisarBtn1\']"
xpath_journal = "//li[text()=\'"+search_journal+"\']"
xpath_timeRange = "//input[@name=\'PeriodoCmb1\' and not(@disabled)]"
xpath_timeSelect = "//li[text()=\'"+search_timeRange+"\']"
xpath_searchTerm = "//input[@name=\'PesquisaTxt1\']"

print("Locating Journal/Periodical")
journal.click()
dropDownJournal = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.XPATH, xpath_journal)))
dropDownJournal.click()
print("Waiting for Time Selection")
try:
    timeRange = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_timeRange)))
    timeRange.click()
    time.sleep(1)
    print("Locating Time Range")    
    dropDownTime = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_timeSelect)))
    dropDownTime.click()
    time.sleep(1)
except:
    print("Failed...")
print("Adding Search Term")

searchTerm = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_searchTerm)))
searchTerm.clear()
searchTerm.send_keys(search_text)
time.sleep(5)

print("Perform search")

submitButton = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_form)))
submitButton.click()

# Wait for the second page to load, pull what we need from it.
download_list = []

browser.switch_to_window(browser.window_handles[-1])
print("Waiting for next page to load...")

matches = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//span[@id=\'OcorNroLbl\']")))
print("Next page ready, found match element... counting")
countText = matches.text
countTotal = int(countText[countText.find("/")+1:])
print("A total of " + str(countTotal) + " matches have been found, standing by for page load.")
for i in range(1, countTotal+2):               
    print("Waiting for page " + str(i-1) + " to load...")
    while(fLink in download_list):
        try:
            jIDElement = browser.find_element_by_xpath("//input[@name=\'HiddenBibAlias\']")
            jPageElement = browser.find_element_by_xpath("//input[@name=\'hPagFis\']")
            fLink = "http://memoria.bn.br/DocReader/" + jIDElement.get_attribute('value') + "/" + jPageElement.get_attribute('value') + "&pesq=" + search_text         
        except:
            fails += 1
            time.sleep(1)
            if(fails == 10):
                print("Locked on a page, attempting to push to next.")
                nextPageButton = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, "//input[@id=\'OcorPosBtn\']")))
                nextPageButton.click()                    
            #raise
        while(fLink == ""):
            jIDElement = browser.find_element_by_xpath("//input[@name=\'HiddenBibAlias\']")
            jPageElement = browser.find_element_by_xpath("//input[@name=\'hPagFis\']")
            fLink = "http://memoria.bn.br/DocReader/" + jIDElement.get_attribute('value') + "/" + jPageElement.get_attribute('value') + "&pesq=" + search_text                     
    fails = 0
    print("Link obtained: " + fLink)
    download_list.append(fLink)

    if(i != countTotal):
        print("Moving to next page...")
        nextPageButton = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, "//input[@id=\'OcorPosBtn\']")))
        nextPageButton.click()

There are two "bugs" I'm trying to solve with this block. First, the very first page is always skipped in the loop (IE: fLink = ""), even though there is a test in there for it, I'm not sure why this occurs. The other bug is that the code will hang on specific pages completely randomly and the only way out is to break the code execution.

This block has been modified a few times so I know it's not the most "elegant" of solutions, but I'm starting to run out of time.

I'm trying to follow the code you provided vs the link you posted and they don't seem to be the same page. It would be better if you explained the scenario represented in your code in words so we can follow along. Also, post a link to the page that your code starts on so we can follow along. — JeffC
– JeffC, Commented Apr 17, 2018 at 2:26
Sure, the code is looking to scrape all instances of a hit for a specific journal / time period. The main page is here. I already have the code that fills in the form there working, it links to the journal entries (The first of which is the link provided above). The goal of the program is to grab two hidden form elements from each hit (names are: HiddenBibAlias and hPagFis). The flow is basically load into the journal pages, grab the two hidden vales and save them, and then go to the next page, looping through until all are done. — Phantom139
– Phantom139, Commented Apr 17, 2018 at 2:38
You are explaining your implementation of the program rather than the goal. What are you actually trying to do? I doubt the goal of the program is to find hidden elements. Are you trying to capture URLs for the various pages of a document or download the images of pages or ??? Your code is not an minimal reproducible example. You have variables that aren't declared. You are starting at a page that you have not told us about. You are getting hidden elements that contain data that is readily available in the URL. I'm still confused as to what you are trying to do. — JeffC
– JeffC, Commented Apr 17, 2018 at 3:14
I've updated the original post to contain the full code used. The goal of the program is to run through the database (Using the three defined variables for the journal, time, and search term) and capture the links of the individual results. The way the website is established however, simply copying the original link will not suffice, you need to capture the two hidden elements from each hit on the search in order to reconstruct a result link, which I want to save to an output file (.txt). I apologize if it wasn't clear, the above URL was the "desired result", built from the hidden fields. — Phantom139
– Phantom139, Commented Apr 17, 2018 at 5:08

Phantom139 · Accepted Answer · 2018-04-18 18:10:35Z

After taking a day off from this to think about it (And get some more sleep), I was able to figure out what was going on. The above code has three "big faults". This first is that it does not handle the StaleElementException versus the NoSuchElementException, which can occur while the page is shifting. Secondly, the loop condition was checking iteratively that a page wasn't in the list, which when entering the first run allowed the blank condition to load in directly as the loop was never executed on the first run (Should have used a do-while there, but I made more modifications). Finally, I made the silly error of only checking if the first hidden element was changing, when in fact that is the journal ID, and is pretty much constant through all.

The revisions began with an adaptation of a code on this other SO article to implement a "hold" condition until either one of the hidden elements changed:

from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import NoSuchElementException
def hold_until_element_changed(driver, element1_xpath, element2_xpath, old_element1_text, old_element2_text):
    while True:
        try:
            element1 = driver.find_element_by_xpath(element1_xpath)
            element2 = driver.find_element_by_xpath(element2_xpath)
            if (element1.get_attribute('value') != old_element1_text) or (element2.get_attribute('value') != old_element2_text):
                break
        except StaleElementReferenceException:
            break
        except NoSuchElementException:
            return False
        time.sleep(1)
    return True

I then modified the original looping condition, going back to the original "for loop" counter I had created without an internal loop, instead shooting a call to the above function to create the "hold" until the page had flipped, and voila, worked like a charm. (NOTE: I also upped the timeout on the next page button as this is what caused the locking condition)

for i in range(1, countTotal+1):               
    print("Waiting for page " + str(i) + " to load...")
    bibxpath = "//input[@name=\'HiddenBibAlias\']"
    pagexpath = "//input[@name=\'hPagFis\']"
    jIDElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, bibxpath)))
    jPageElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, pagexpath)))
    jidtext = jIDElement.get_attribute('value')
    jpagetext = jPageElement.get_attribute('value')
    fLink = "http://memoria.bn.br/DocReader/" + jidtext + "/" + jpagetext + "&pesq=" + search_text         
    print("Link obtained: " + fLink)
    download_list.append(fLink)

    if(i != countTotal):
        print("Moving to next page...")
        nextPageButton = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//input[@id=\'OcorPosBtn\']")))
        nextPageButton.click()
        # Wait for next page to be ready
        change = hold_until_element_changed(browser, bibxpath, pagexpath, jidtext, jpagetext)
        if(change == False):
            print("Something went wrong.")

All in all, a good exercise in thought and some helpful links for me to consider when posting future questions. Thanks!

Collectives™ on Stack Overflow

Scraping an updating JavaScript page in Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related