I'm currently working on a research project in which we are trying to collect saved image files from Brazil's Hemeroteca database. I've done web scraping on PHP pages before using C/C++ with HTML forms, but as this is a shared script, I need to switch to python such that everyone in the group can use this tool.
The page which I'm trying to scrape is: http://bndigital.bn.gov.br/hemeroteca-digital/
There are three forms which populate, the first being the newspaper/journal. Upon selecting this, the available times populate, and the final field is the search term. I've inspected the HTML page here and the three IDs of these are respectively: 'PeriodicoCmb1_Input', 'PeriodoCmb1_Input', and 'PesquisaTxt1'.
Some google searches on this topic led me to the Selenium package, and I've put together this sample code to attempt to read the page:
import webbrowser
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
print("Begin...")
browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)
print("Waiting to load page... (Delay 3 seconds)")
time.sleep(3)
print("Searching for elements")
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
print(journal)
print("Set fields, delay 3 seconds between input")
search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"
journal.send_keys(search_journal)
time.sleep(3)
timeRange.send_keys(search_timeRange)
time.sleep(3)
searchTerm.send_keys(search_text)
print("Perform search")
submitButton = button.find_element_by_id("PesquisarBtn1_input")
submitButton.click()
The script runs to the print(journal) statement, where an error is thrown saying the element cannot be found.
Can anyone take a quick sweep of the page in question and make sure I've got the general premise of this script in line correctly, or point me towards some examples to get me running on this problem?
Thanks!
PeriodicoCmb1_Input,PeriodoCmb1_InputandPesquisaTxt1on the page?