I work for a company that helps project creators market their products / ideas that have been launched either on Kickstarter or Indiegogo. Part of my daily duties is to compile the campaign images to design and use them accordingly - marketing purposes.
I quickly thought, perfect thing for a small web scraper to handle. Which I then built using Python's Request and BeautifulSoup. It worked great for months...until Kickstarter seems to have switched to a dynamic javascript loading of the campaign content. Which by the way, this is how Indiegogo handles their campaign page, and thus why I never got around to get it working for Indie.
Now I have to really get in there and figure out how to handle sites using javascript (Angular, React, Vue etc) to handle the scraping. This is what I have so far and it is working - kind of.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.kickstarter.com/projects/swimn/swimn-s1-the-amazing-powered-kickboard"
elClass = "rte__content"
driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()
soup = BeautifulSoup(html, 'lxml')
ele = soup.find('div', {'class': elClass})
imgs = ele.find_all('img', {'class': 'fit'})
for img in imgs:
print(img)
I get the img elements in raw text, however if I copy the src url and try to visit the image page I get a 403 forbidden error. But if I copy the same url from the dev tools in the kickstarter page and then paste in a new tab, I get to the image with no issues.
What is it that is blocking me from accessing those images?
Any feedback would greatly be appreciated.
print(img)is adding a bunch ofamp;after every&in the url. After removing all thoseamp;I can get to the image(s) with no issues.