Resolving Angular.JS when using Python Requests Module

Question

I am making a python script to collect image urls from a site that uses Angular JS. However, the requests.get requests returns the website without the Angular.JS resolved. For example...

>>>import requests

>>>url = "https://website.com"
>>request = requests.get(url)

>>>requests.text
<img ng-src="{{ getThumbnail(attachment).href }}" >

I've tried looking for alternatives to using the requests module, but I haven't been able to find anyone else talking specifically about this issue so most of my attempts to using other modules have been complete shots in the dark. What alternatives do I have to retrieve the Angular href?

Have you tried pypi.org/project/requests-html which has Full JavaScript support — Dan-Dev
– Dan-Dev, Commented Dec 21, 2019 at 14:41
@Dan-Dev I have not yet. That sounds pretty promising, I will try it out. — AFW
– AFW, Commented Dec 21, 2019 at 14:49
@Dan-Dev Alright; I think I have tried sufficiently to say "No", I don't think the requests_html module is resolving the Angular.JS, or I am doing something wrong. My perception is that the following should work: r = session.get('website.com'); r.html.render(); r.text But that does not seem to work. I am passing the text to bs4 in order to locate it; but the text is still "<img ng-src="{{ getThumbnail(attachment).href }}" >" — AFW
– AFW, Commented Dec 21, 2019 at 21:35
Sure. @Dan-Dev I am trying to pull the href for the images from this site: namus.gov/MissingPersons/Case#/51238 The images are in the <div class="attachment-image"><img> — AFW
– AFW, Commented Dec 21, 2019 at 22:17

Dan-Dev · Accepted Answer · 2019-12-21 23:33:19Z

2

The problem with requests-html is your URL it contains a # or fragment identifier

From https://en.wikipedia.org/wiki/Fragment_identifier

When an agent (such as a web browser) requests a web resource from a web server, the agent sends the URI to the server, but does not send the fragment. Instead, the agent waits for the server to send the resource, and then the agent processes the resource according to the document type and fragment value.

requests-html does not look like it is using the fragment identifier.

The only option I can think of is using Selenium.

sudo pip3 install selenium

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads (Depending upon your OS you may need to specify the location of your driver)

from selenium import webdriver


url = "https://www.namus.gov/MissingPersons/Case#/51238/"
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(10)
element = driver.find_element_by_class_name("section-list")

for child_element in element.find_elements_by_xpath(".//a"):
    print(child_element.get_attribute('href'))

driver.quit()

Outputs:

https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83268/Original
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83270/Original
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83271/Original
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83272/Original
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83273/Original
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83274/Original

edited Dec 21, 2019 at 23:33

answered Dec 21, 2019 at 23:27

Dan-Dev

9,5783 gold badges42 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AFW Over a year ago

Thank you for the thorough answer explaining why requests--html didn't work and thanks for providing the proof of concept. That does exactly what I need it to.

Collectives™ on Stack Overflow

Resolving Angular.JS when using Python Requests Module

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related