1

I am making a python script to collect image urls from a site that uses Angular JS. However, the requests.get requests returns the website without the Angular.JS resolved. For example...

>>>import requests

>>>url = "https://website.com"
>>request = requests.get(url)

>>>requests.text
<img ng-src="{{ getThumbnail(attachment).href }}" >

I've tried looking for alternatives to using the requests module, but I haven't been able to find anyone else talking specifically about this issue so most of my attempts to using other modules have been complete shots in the dark. What alternatives do I have to retrieve the Angular href?

5
  • Have you tried pypi.org/project/requests-html which has Full JavaScript support Commented Dec 21, 2019 at 14:41
  • @Dan-Dev I have not yet. That sounds pretty promising, I will try it out. Commented Dec 21, 2019 at 14:49
  • @Dan-Dev Alright; I think I have tried sufficiently to say "No", I don't think the requests_html module is resolving the Angular.JS, or I am doing something wrong. My perception is that the following should work: r = session.get('website.com'); r.html.render(); r.text But that does not seem to work. I am passing the text to bs4 in order to locate it; but the text is still "<img ng-src="{{ getThumbnail(attachment).href }}" >" Commented Dec 21, 2019 at 21:35
  • Is it possible to post the URL? Commented Dec 21, 2019 at 22:05
  • Sure. @Dan-Dev I am trying to pull the href for the images from this site: namus.gov/MissingPersons/Case#/51238 The images are in the <div class="attachment-image"><img> Commented Dec 21, 2019 at 22:17

1 Answer 1

2

The problem with requests-html is your URL it contains a # or fragment identifier

From https://en.wikipedia.org/wiki/Fragment_identifier

When an agent (such as a web browser) requests a web resource from a web server, the agent sends the URI to the server, but does not send the fragment. Instead, the agent waits for the server to send the resource, and then the agent processes the resource according to the document type and fragment value.

requests-html does not look like it is using the fragment identifier.

The only option I can think of is using Selenium.

sudo pip3 install selenium

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads (Depending upon your OS you may need to specify the location of your driver)

from selenium import webdriver


url = "https://www.namus.gov/MissingPersons/Case#/51238/"
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(10)
element = driver.find_element_by_class_name("section-list")

for child_element in element.find_elements_by_xpath(".//a"):
    print(child_element.get_attribute('href'))

driver.quit()

Outputs:

https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83268/Original
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83270/Original
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83271/Original
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83272/Original
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83273/Original
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/51238/Images/83274/Original
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the thorough answer explaining why requests--html didn't work and thanks for providing the proof of concept. That does exactly what I need it to.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.