I need to pull out some information from some links that contain some javascript code. I know how to do it with Selenium, but it takes a lot of time and I need more efficient way to pull this off.
I cam across the requests-html library and it looks quite robust way for my purposes, but unfortunately it doesn't look like I'm able to run the javascript with it.
I read the documentation from the following link https://requests-html.readthedocs.io/en/latest/
And tried the following code:
from requests_html import HTMLSession,HTML
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://drive.google.com/file/d/1rZ-DhTFPCen6DvJXlNl3Bxuwj4-ULwoa/view")
resp.html.render()
soup = BeautifulSoup(resp.html.html, 'lxml')
email = soup.find_all('img', {'class':'ndfHFb-c4YZDc-MZArnb-BA389-YLEF4c'})
print(email)
I get no results after running this code, even though the class exists if I open the link from my browser.
I've also tried using headers with my requests with no help. I tried the same code (with different html tag, of course) for another link (https://web.archive.org/web/*/stackoverflow.com) but I get some html text including a response that says that my browser must support javascript. My code for this part:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://web.archive.org/web/*/stackoverflow.com")
resp.html.render()
soup = BeautifulSoup(resp.html.html, 'lxml')
print(soup)
The response I get:
<div class="no-script-message">
The Wayback Machine requires your browser to support JavaScript, please email <a href="mailto:[email protected]">[email protected]</a><br/>if you have any questions about this.
</div>
Any help would be appreciated. Thanks!