0

I work for a company that helps project creators market their products / ideas that have been launched either on Kickstarter or Indiegogo. Part of my daily duties is to compile the campaign images to design and use them accordingly - marketing purposes.

I quickly thought, perfect thing for a small web scraper to handle. Which I then built using Python's Request and BeautifulSoup. It worked great for months...until Kickstarter seems to have switched to a dynamic javascript loading of the campaign content. Which by the way, this is how Indiegogo handles their campaign page, and thus why I never got around to get it working for Indie.

Now I have to really get in there and figure out how to handle sites using javascript (Angular, React, Vue etc) to handle the scraping. This is what I have so far and it is working - kind of.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.kickstarter.com/projects/swimn/swimn-s1-the-amazing-powered-kickboard"
elClass = "rte__content"

driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()
soup = BeautifulSoup(html, 'lxml')
ele = soup.find('div', {'class': elClass})
imgs = ele.find_all('img', {'class': 'fit'})

for img in imgs:
    print(img)

I get the img elements in raw text, however if I copy the src url and try to visit the image page I get a 403 forbidden error. But if I copy the same url from the dev tools in the kickstarter page and then paste in a new tab, I get to the image with no issues.

What is it that is blocking me from accessing those images?

Any feedback would greatly be appreciated.

1
  • I just realized that the print(img) is adding a bunch of amp; after every & in the url. After removing all those amp; I can get to the image(s) with no issues. Commented Dec 5, 2019 at 17:15

1 Answer 1

2

There is something you missed. Just replace the last line with

print(img['src'])

When you simply print img it prints the entire tag which is in encoded form. When you directly print the src it gets decoded and hence you can visit the url without error.

Output

https://ksr-ugc.imgix.net/assets/027/167/793/9462ca5c02772e8f976aa9edbdf4dab1_original.gif?ixlib=rb-2.1.0&w=680&fit=max&v=1573434861&auto=format&gif-q=50&q=92&s=4a3ec9017d091efe146aec692b2af720

I am also adding the entire code to avoid any confusion.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time

url = "https://www.kickstarter.com/projects/swimn/swimn-s1-the-amazing-powered-kickboard"
elClass = "rte__content"

driver = webdriver.Chrome()
driver.get(url)
time.sleep(5)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()
soup = BeautifulSoup(html, 'lxml')
ele = soup.find('div', {'class': elClass})
imgs = ele.find_all('img', {'class': 'fit'})

for img in imgs:
    print(img['src'])

Kindly notice the last line. That is the only change needed.

EDIT (headless)

from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)

Just add this lines before initializing Chrome.

Hope this solves your issue :)

Sign up to request clarification or add additional context in comments.

5 Comments

Yep, that worked. Thank you so much. Do you know of a way to do this without having Chrome having to open up a new window? basically the scraping is happening in the background - without windows opening and closing.
yeah make it headless
Aaah what can I use besides Chrome as a webdriver that would make it headless?
you can make chrome headless. not need of any additional dependencies just a import statement
@Sergio I have added the code to make the browser headless. And if this solves your problem then kindly accept the answer so that in future anyone else with the same doubt can get his/her solution quickly. Cheers :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.