Python / BeautifulSoup / Selenium web scraping - not able to view content

Question

I work for a company that helps project creators market their products / ideas that have been launched either on Kickstarter or Indiegogo. Part of my daily duties is to compile the campaign images to design and use them accordingly - marketing purposes.

I quickly thought, perfect thing for a small web scraper to handle. Which I then built using Python's Request and BeautifulSoup. It worked great for months...until Kickstarter seems to have switched to a dynamic javascript loading of the campaign content. Which by the way, this is how Indiegogo handles their campaign page, and thus why I never got around to get it working for Indie.

Now I have to really get in there and figure out how to handle sites using javascript (Angular, React, Vue etc) to handle the scraping. This is what I have so far and it is working - kind of.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.kickstarter.com/projects/swimn/swimn-s1-the-amazing-powered-kickboard"
elClass = "rte__content"

driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()
soup = BeautifulSoup(html, 'lxml')
ele = soup.find('div', {'class': elClass})
imgs = ele.find_all('img', {'class': 'fit'})

for img in imgs:
    print(img)

I get the img elements in raw text, however if I copy the src url and try to visit the image page I get a 403 forbidden error. But if I copy the same url from the dev tools in the kickstarter page and then paste in a new tab, I get to the image with no issues.

What is it that is blocking me from accessing those images?

Any feedback would greatly be appreciated.

I just realized that the print(img) is adding a bunch of amp; after every & in the url. After removing all those amp; I can get to the image(s) with no issues. — Sergio
– Sergio, Commented Dec 5, 2019 at 17:15

Debdut Goswami · Accepted Answer · 2019-12-05 17:40:41Z

2

There is something you missed. Just replace the last line with

print(img['src'])

When you simply print img it prints the entire tag which is in encoded form. When you directly print the src it gets decoded and hence you can visit the url without error.

Output

https://ksr-ugc.imgix.net/assets/027/167/793/9462ca5c02772e8f976aa9edbdf4dab1_original.gif?ixlib=rb-2.1.0&w=680&fit=max&v=1573434861&auto=format&gif-q=50&q=92&s=4a3ec9017d091efe146aec692b2af720

I am also adding the entire code to avoid any confusion.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time

url = "https://www.kickstarter.com/projects/swimn/swimn-s1-the-amazing-powered-kickboard"
elClass = "rte__content"

driver = webdriver.Chrome()
driver.get(url)
time.sleep(5)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()
soup = BeautifulSoup(html, 'lxml')
ele = soup.find('div', {'class': elClass})
imgs = ele.find_all('img', {'class': 'fit'})

for img in imgs:
    print(img['src'])

Kindly notice the last line. That is the only change needed.

EDIT (headless)

from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)

Just add this lines before initializing Chrome.

Hope this solves your issue :)

edited Dec 5, 2019 at 17:40

answered Dec 5, 2019 at 17:28

Debdut Goswami

1,3791 gold badge14 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Sergio Over a year ago

Yep, that worked. Thank you so much. Do you know of a way to do this without having Chrome having to open up a new window? basically the scraping is happening in the background - without windows opening and closing.

Debdut Goswami Over a year ago

yeah make it headless

Sergio Over a year ago

Aaah what can I use besides Chrome as a webdriver that would make it headless?

Debdut Goswami Over a year ago

you can make chrome headless. not need of any additional dependencies just a import statement

Debdut Goswami Over a year ago

@Sergio I have added the code to make the browser headless. And if this solves your problem then kindly accept the answer so that in future anyone else with the same doubt can get his/her solution quickly. Cheers :)

Collectives™ on Stack Overflow

Python / BeautifulSoup / Selenium web scraping - not able to view content

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related