2

I'm trying to download files from a VA dataset website with a Python scraper but I'm having trouble figuring out how to parse the JavaScript in the HTML website that appears to contain the files. This is source code for the website (view-source:https://www.data.va.gov/dataset/Air-Force-Veterans-2017-Living-Only/9u8y-zaby). I'm trying to download the ".xlsx" files, which (by just using command+F on my Mac) I think are in JavaScript objects. I've looked around this site and others but haven't been able to figure out how to scrape links from within JavaScript. How should I go about doing this? Any help would be greatly appreciated.

7
  • .xlsx means that it's an Excel file. Commented May 17, 2022 at 18:01
  • @Pointy I realize that and it seems they're not contained in the usual HTML, but rather JavaScript objects Commented May 17, 2022 at 18:04
  • No, the JavaScript is probably for arranging the download of the .xlsx files. The files themselves are probably (almost certainly) separate URLs. Commented May 17, 2022 at 18:37
  • In fact you can clearly see it if you inspect that <a> element; the URL for the download is "data.va.gov/download/9u8y-zaby/…" Commented May 17, 2022 at 18:39
  • 1
    Storing .xlsx source in JavaScript on a page would be fairly insane, as .xlsx files are generally very large. Commented May 17, 2022 at 18:41

1 Answer 1

1

That website is dynamically generated, you can use selenium to download the desired files

Here is a working code using wget, selenium and webdriver_manager

This will check for the link and save the xlsx file in used-defined directory

import time
import wget
import requests
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions, FirefoxOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = ChromeOptions()
# try out options
# options.binary_location = '/opt/headless-chromium'
# options.add_argument("--headless")
# options.add_argument("--disable-gpu")
# options.add_argument("--no-sandbox")
# options.add_argument('--disable-dev-shm-usage')
# options.add_argument('--disable-gpu-sandbox')
# options.add_argument("--single-process")
options.add_argument(
    "user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("useAutomationExtension", False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])

s = Service(ChromeDriverManager().install())
# s = Service(GeckoDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)

driver.get('https://www.data.va.gov/dataset/Air-Force-Veterans-2017-Living-Only/9u8y-zaby')
time.sleep(3)

# get the link
download_link = driver.find_element(By.XPATH, '//*[@id="app"]/div/div[2]/section/div/div/div[2]/a').get_attribute(
    'href')

# download the file
output_directory = 'data'  # it will download the file to data directory
filename = wget.download(download_link, out=output_directory)

time.sleep(3)
driver.close()
driver.quit()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.