2

Hey guys I am trying to work with selenium using threads. My code is :-

import threading  as th
import time
import base64
import mysql.connector as mysql
import requests
from bs4 import BeautifulSoup
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from functions import *

options = Options()
prefs = {'profile.default_content_setting_values': {'images': 2,'popups': 2, 'geolocation': 2, 
                            'notifications': 2, 'auto_select_certificate': 2, 'fullscreen': 2, 
                            'mouselock': 2, 'mixed_script': 2, 'media_stream': 2, 
                            'media_stream_mic': 2, 'media_stream_camera': 2, 'protocol_handlers': 2, 
                            'ppapi_broker': 2, 'automatic_downloads': 2, 'midi_sysex': 2, 
                            'push_messaging': 2, 'ssl_cert_decisions': 2, 'metro_switch_to_desktop': 2, 
                            'protected_media_identifier': 2, 'app_banner': 2, 'site_engagement': 2, 
                            'durable_storage': 2}}
print('Crawling process started')
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(executable_path='chromedriver.exe', options=options)
driver.set_page_load_timeout(50000)
urls='https://google.com https://youtube.com'
def getinf(url_):
    driver.get(url_)
    soup=BeautifulSoup(driver.page_source, 'html5lib')
    print(soup.select('title'))
for url in urls.split():
    t=th.Thread(target=getinf, args=(url,))
    t.start()

When the script run the tabs are not opened at once as I expected(from threads) instead the process is done one by one and the title of last url(https://youtube.com) is only printed. when I try Multiprocessing , program crashes many times. I am making a web crawler and some websites(like twitter) requires JavaScript for showing content, so I can't use requests or urllib as well. What can be the solution for this. Any other library suggestion will be welcomed.

5
  • 1
    Youtube and Twiiter have Python APIs. Commented Jan 19, 2020 at 13:22
  • 1
    selenium drivers are not thread safe Commented Jan 19, 2020 at 13:24
  • I don't want to develop a software for YouTube, twitter, etc seperately to extract data. I want whole in one . How can I do that? Commented Jan 19, 2020 at 14:28
  • If it has to be python, there's pyppeteer, otherwise puppeteer is a better choice. Commented Jan 20, 2020 at 2:34
  • Hi! Have you solved your problem? I also use seleniumwire as you do and encounter the same problem. I found multithreading is not thread safe, but seems seleniumwire can't be operated through multiprocess either due to its inner proxy. So I wonder have you already solved this problem? If so, then how have you solved it? Commented Aug 28, 2021 at 7:05

1 Answer 1

1

Try putting the creation of the chromedriver in the thread code. Otherwise you have one driver, and you are changing the url of one and the same driver. Instead try to create separate chromedriver for each thread.

Note: I have not tried the code, just suggestion.

import threading  as th
import time
import base64
import mysql.connector as mysql
import requests
from bs4 import BeautifulSoup
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from functions import *

options = Options()
prefs = {'profile.default_content_setting_values': {'images': 2,'popups': 2, 'geolocation': 2, 
                            'notifications': 2, 'auto_select_certificate': 2, 'fullscreen': 2, 
                            'mouselock': 2, 'mixed_script': 2, 'media_stream': 2, 
                            'media_stream_mic': 2, 'media_stream_camera': 2, 'protocol_handlers': 2, 
                            'ppapi_broker': 2, 'automatic_downloads': 2, 'midi_sysex': 2, 
                            'push_messaging': 2, 'ssl_cert_decisions': 2, 'metro_switch_to_desktop': 2, 
                            'protected_media_identifier': 2, 'app_banner': 2, 'site_engagement': 2, 
                            'durable_storage': 2}}
print('Crawling process started')
options.add_experimental_option('prefs', prefs)
urls='https://google.com https://youtube.com'
def getinf(url_):
    driver = webdriver.Chrome(executable_path='chromedriver.exe', options=options)
    driver.set_page_load_timeout(50000)
    driver.get(url_)
    soup=BeautifulSoup(driver.page_source, 'html5lib')
    print(soup.select('title'))
for url in urls.split():
    t=th.Thread(target=getinf, args=(url,))
    t.start()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.