I'm trying to download files from a VA dataset website with a Python scraper but I'm having trouble figuring out how to parse the JavaScript in the HTML website that appears to contain the files. This is source code for the website (view-source:https://www.data.va.gov/dataset/Air-Force-Veterans-2017-Living-Only/9u8y-zaby). I'm trying to download the ".xlsx" files, which (by just using command+F on my Mac) I think are in JavaScript objects. I've looked around this site and others but haven't been able to figure out how to scrape links from within JavaScript. How should I go about doing this? Any help would be greatly appreciated.
1 Answer
That website is dynamically generated, you can use selenium to download the desired files
Here is a working code
using wget, selenium and webdriver_manager
This will check for the link and save the xlsx file in used-defined directory
import time
import wget
import requests
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions, FirefoxOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
options = ChromeOptions()
# try out options
# options.binary_location = '/opt/headless-chromium'
# options.add_argument("--headless")
# options.add_argument("--disable-gpu")
# options.add_argument("--no-sandbox")
# options.add_argument('--disable-dev-shm-usage')
# options.add_argument('--disable-gpu-sandbox')
# options.add_argument("--single-process")
options.add_argument(
"user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("useAutomationExtension", False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])
s = Service(ChromeDriverManager().install())
# s = Service(GeckoDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)
driver.get('https://www.data.va.gov/dataset/Air-Force-Veterans-2017-Living-Only/9u8y-zaby')
time.sleep(3)
# get the link
download_link = driver.find_element(By.XPATH, '//*[@id="app"]/div/div[2]/section/div/div/div[2]/a').get_attribute(
'href')
# download the file
output_directory = 'data' # it will download the file to data directory
filename = wget.download(download_link, out=output_directory)
time.sleep(3)
driver.close()
driver.quit()
.xlsxmeans that it's an Excel file..xlsxfiles. The files themselves are probably (almost certainly) separate URLs.<a>element; the URL for the download is "data.va.gov/download/9u8y-zaby/…".xlsxsource in JavaScript on a page would be fairly insane, as.xlsxfiles are generally very large.