3

Note: Can be any solution, selenium seems like the most likely tool to solve this.

Imgur has albums, the image links of the albums are stored in (a React element?) GalleryPost.album_image_store._.posts.{ALBUM_ID}.images (thanks to this guy for figuring this out).

Using React DevTools extension for chrome I can see this array of image links, but I want to be able to access this from a python script.

Any ideas how?

P.s. I don't know much at all about react, so please excuse my if this is a stupid question or for possibly using incorrect terminology.

Here's the album I've been working with: https://i.sstatic.net/545pu.jpg

Implemented Solution:

Huge thanks to Eduard Florinescu for working with me to figure all this out. Didn't know hardly anything about selenium, how to run javascript in selenium, or any commands I could use.

Modifying some of his code, I ended up with the following.

from time import sleep

from bs4 import BeautifulSoup
from selenium import webdriver  
from selenium.webdriver.chrome.options import Options


# Snagged from: https://stackoverflow.com/a/480227
def rmdupe(seq):
    # Removes duplicates from list
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]


chrome_options = Options()  
chrome_options.add_argument("--headless")  

prefs = {"profile.managed_default_content_settings.images":2}
chrome_options.add_experimental_option("prefs",prefs)

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.set_window_size(1920, 10000)
driver.get("https://i.sstatic.net/545pu.jpg")


links = []
for i in range(0, 10):  # Tune as needed
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for div in soup.find_all('div', {'class': 'image post-image'}):
        imgs = div.find_all('img')
        for img in imgs:
            srcs = img.get_attribute_list('src')
            links.extend(srcs)
        sources = div.find_all('source')
        for s in sources:
            srcs = s.get_attribute_list('src')
            links.extend(srcs)
    links = rmdupe(links)  # Remove duplicates
    driver.execute_script('window.scrollBy(0, 750)')
    sleep(.2)

>>> len(links)
# 36 -- Huzzah! Got all the album links!

Notes:

  • Creates a headless chrome instance, so the code can be implemented in a script or potentially a larger project.

  • I used BeautifulSoup because it's a bit easier to work with and I was having some issues with finding elements and accessing their values using selenium (likely due to inexperience).

  • Set the display size to be really "tall" so more image links are loaded at once.

  • Disabled images in chrome browser settings to stop the browser from actually downloading the images (all I need are the links).

  • Some links are .mp4 files and are rendered in html as video elements with <source> tags contained inside which contain the link. The portion of code starting with sources = div.find_all('source') is there to make sure no album links are lost.

5
  • Can you add link to that page? Commented Feb 15, 2018 at 19:27
  • 1
    Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it. Commented Feb 15, 2018 at 19:28
  • @MarioNikolaus Any imgur album will work. Here's an example: imgur.com/a/JNzjB. Commented Feb 15, 2018 at 19:37
  • At first glance, you could retrieve the links using an XPath (//div[@class="post-images"]//img) and doing get_attribute('src'), but the thing is the DOM changes as you scroll down... at least it's a start. :P Commented Feb 15, 2018 at 22:37
  • @Mangohero1 Exactly the problem I'm running into. Being able to access the react components would solve the problem, but I can't find any way to do this. Commented Feb 15, 2018 at 22:41

1 Answer 1

4

You don't need to know any framework to automate any page. You need to just access the DOM and you can do that with selenium and python. But sometimes some simple Vanilla JavaScript helps.

To get those links you can try and paste this in console:

images_links =[]; images = document.querySelectorAll("img"); for (image of images){images_links.push(image.src)} console.log(images_links)

Also the selenium with python and the above JS snippet is:

import selenium
from selenium import webdriver
from time import sleep
driver = webdriver.Chrome()

driver.get("https://imgur.com/a/JNzjB")
for i in range(0,7): # here you will need to tune to see exactly how many scrolls you need
  driver.execute_script('window.scrollBy(0, 2000)')

sleep(2)
list_of_images_links=driver.execute_script('images_links =[]; images = document.querySelectorAll("img"); for (image of images){images_links.push(image.src)} return images_links;')
list_of_images_links

enter image description here

Update:

you don't need selenium just paste this in an Opera console (see that you enable multiple Downloads) and voila:

document.body.style.zoom=0.1; images=document.querySelectorAll("img"); for (i of images) { var a = document.createElement('a'); a.href = i.src; console.log(i); a.download = i.src; document.body.appendChild(a); a.click(); document.body.removeChild(a); }

same thing beautified for reading:

document.body.style.zoom=0.1;
images = document.querySelectorAll("img");
for (i of images) {
    var a = document.createElement('a');
    a.href = i.src;
    console.log(i);
    a.download = i.src;
    document.body.appendChild(a);
    a.click();
    document.body.removeChild(a);
}

Update 2 Opera webdriver

import os
from time import sleep
from selenium import webdriver
from selenium.webdriver.common import desired_capabilities
from selenium.webdriver.opera import options

_operaDriverLoc = os.path.abspath('c:\\Python27\\Scripts\\operadriver.exe')  # Replace this path with the actual path on your machine.
_operaExeLoc = os.path.abspath('c:\\Program Files\\Opera\\51.0.2830.34\\opera.exe')   # Replace this path with the actual path on your machine.

_remoteExecutor = 'http://127.0.0.1:9515'
_operaCaps = desired_capabilities.DesiredCapabilities.OPERA.copy()

_operaOpts = options.ChromeOptions()
_operaOpts._binary_location = _operaExeLoc

# Use the below argument if you want the Opera browser to be in the maximized state when launching.
# The full list of supported arguments can be found on http://peter.sh/experiments/chromium-command-line-switches/
_operaOpts.add_argument('--start-maximized')

driver = webdriver.Chrome(executable_path = _operaDriverLoc, chrome_options = _operaOpts, desired_capabilities = _operaCaps)


driver.get("https://imgur.com/a/JNzjB")
for i in range(0,7): # here you will need to tune to see exactly how many scrolls you need
  driver.execute_script('window.scrollBy(0, 2000)')

sleep(4)
driver.execute_script("document.body.style.zoom=0.1")
list_of_images_links=driver.execute_script('images_links =[]; images = document.querySelectorAll("img"); for (image of images){images_links.push(image.src)} return images_links;')
list_of_images_links
driver.execute_script('document.body.style.zoom=0.1; images=document.querySelectorAll("img"); for (i of images) { var a = document.createElement("a"); a.href = i.src; console.log(i); a.download = i.src; document.body.appendChild(a); a.click(); document.body.removeChild(a); }')
Sign up to request clarification or add additional context in comments.

20 Comments

Tried the selenium portion myself, but it doesn't return all the links in the album. As your screenshot shows, it only returns 4 links, though there are 36 in that album. At least it's consistent, I got the exact same links back as you did.
The reason it's not getting all the links is because imgur dynamically loads the images based on the scroll position. If you scroll all the way down, you'll only see the last 4 images, hence why only 4 were returned. Is there a way to get all the images that have been loaded instead of the images currently in the html source? This is why I was hoping for a way to query the react props.
I will try with opera driver it's a chrome thing, if you wait a bit I didn't use operadriver before
Same on windows it seems that opera support sucks and the only guy working for it quit github.com/operasoftware/operachromiumdriver/issues/27 Did you try the code in console works for you ?
I will look once more into this stackoverflow.com/questions/31055124/… on how to make opera work and then if it doesn't give up.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.