0

I'm currently working on a research project in which we are trying to collect saved image files from Brazil's Hemeroteca database. I've done web scraping on PHP pages before using C/C++ with HTML forms, but as this is a shared script, I need to switch to python such that everyone in the group can use this tool.

The page which I'm trying to scrape is: http://bndigital.bn.gov.br/hemeroteca-digital/

There are three forms which populate, the first being the newspaper/journal. Upon selecting this, the available times populate, and the final field is the search term. I've inspected the HTML page here and the three IDs of these are respectively: 'PeriodicoCmb1_Input', 'PeriodoCmb1_Input', and 'PesquisaTxt1'.

Some google searches on this topic led me to the Selenium package, and I've put together this sample code to attempt to read the page:

import webbrowser
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time

print("Begin...")

browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)

print("Waiting to load page... (Delay 3 seconds)")

time.sleep(3)

print("Searching for elements")

journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")

print(journal)

print("Set fields, delay 3 seconds between input")

search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"

journal.send_keys(search_journal)
time.sleep(3)
timeRange.send_keys(search_timeRange)
time.sleep(3)
searchTerm.send_keys(search_text)

print("Perform search")

submitButton = button.find_element_by_id("PesquisarBtn1_input")  
submitButton.click()

The script runs to the print(journal) statement, where an error is thrown saying the element cannot be found.

Can anyone take a quick sweep of the page in question and make sure I've got the general premise of this script in line correctly, or point me towards some examples to get me running on this problem?

Thanks!

1
  • Are your sure about the ids? I cannt find PeriodicoCmb1_Input, PeriodoCmb1_Input and PesquisaTxt1 on the page? Commented Mar 28, 2018 at 21:18

1 Answer 1

1

Your DOM elements you are trying to find are located in iframe. So before using find_element_by_id API you should switch to iframe context.

Here is a code how to switch to iframe context:

# add your code

frame_ref = browser.find_elements_by_tag_name("iframe")[0]

iframe = browser.switch_to.frame(frame_ref)

journal = browser.find_element_by_id("PeriodicoCmb1_Input")

timeRange = browser.find_element_by_id("PeriodoCmb1_Input")

searchTerm = browser.find_element_by_id("PesquisaTxt1")

# add your code

Here is a link describing switching to iframe context.

Sign up to request clarification or add additional context in comments.

1 Comment

Worked like a charm, some more modifications are needed to capture the update events on the JS elements, but I think I should be able to figure it out from here. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.