Python - scraping files that are in JavaScript objects

Question

I'm trying to download files from a VA dataset website with a Python scraper but I'm having trouble figuring out how to parse the JavaScript in the HTML website that appears to contain the files. This is source code for the website (view-source:https://www.data.va.gov/dataset/Air-Force-Veterans-2017-Living-Only/9u8y-zaby). I'm trying to download the ".xlsx" files, which (by just using command+F on my Mac) I think are in JavaScript objects. I've looked around this site and others but haven't been able to figure out how to scrape links from within JavaScript. How should I go about doing this? Any help would be greatly appreciated.

@Pointy I realize that and it seems they're not contained in the usual HTML, but rather JavaScript objects — Aaron
– Aaron, Commented May 17, 2022 at 18:04
No, the JavaScript is probably for arranging the download of the .xlsx files. The files themselves are probably (almost certainly) separate URLs. — Pointy
– Pointy, Commented May 17, 2022 at 18:37
In fact you can clearly see it if you inspect that <a> element; the URL for the download is "data.va.gov/download/9u8y-zaby/…" — Pointy
– Pointy, Commented May 17, 2022 at 18:39
Storing .xlsx source in JavaScript on a page would be fairly insane, as .xlsx files are generally very large. — Pointy
– Pointy, Commented May 17, 2022 at 18:41

ahmedshahriar · Accepted Answer · 2022-05-19 01:16:59Z

That website is dynamically generated, you can use selenium to download the desired files

Here is a working code using wget, selenium and webdriver_manager

This will check for the link and save the xlsx file in used-defined directory

import time
import wget
import requests
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions, FirefoxOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = ChromeOptions()
# try out options
# options.binary_location = '/opt/headless-chromium'
# options.add_argument("--headless")
# options.add_argument("--disable-gpu")
# options.add_argument("--no-sandbox")
# options.add_argument('--disable-dev-shm-usage')
# options.add_argument('--disable-gpu-sandbox')
# options.add_argument("--single-process")
options.add_argument(
    "user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("useAutomationExtension", False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])

s = Service(ChromeDriverManager().install())
# s = Service(GeckoDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)

driver.get('https://www.data.va.gov/dataset/Air-Force-Veterans-2017-Living-Only/9u8y-zaby')
time.sleep(3)

# get the link
download_link = driver.find_element(By.XPATH, '//*[@id="app"]/div/div[2]/section/div/div/div[2]/a').get_attribute(
    'href')

# download the file
output_directory = 'data'  # it will download the file to data directory
filename = wget.download(download_link, out=output_directory)

time.sleep(3)
driver.close()
driver.quit()

Collectives™ on Stack Overflow

Python - scraping files that are in JavaScript objects

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related