How to use selenium with python in multithreaded way

Question

Hey guys I am trying to work with selenium using threads. My code is :-

import threading  as th
import time
import base64
import mysql.connector as mysql
import requests
from bs4 import BeautifulSoup
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from functions import *

options = Options()
prefs = {'profile.default_content_setting_values': {'images': 2,'popups': 2, 'geolocation': 2, 
                            'notifications': 2, 'auto_select_certificate': 2, 'fullscreen': 2, 
                            'mouselock': 2, 'mixed_script': 2, 'media_stream': 2, 
                            'media_stream_mic': 2, 'media_stream_camera': 2, 'protocol_handlers': 2, 
                            'ppapi_broker': 2, 'automatic_downloads': 2, 'midi_sysex': 2, 
                            'push_messaging': 2, 'ssl_cert_decisions': 2, 'metro_switch_to_desktop': 2, 
                            'protected_media_identifier': 2, 'app_banner': 2, 'site_engagement': 2, 
                            'durable_storage': 2}}
print('Crawling process started')
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(executable_path='chromedriver.exe', options=options)
driver.set_page_load_timeout(50000)
urls='https://google.com https://youtube.com'
def getinf(url_):
    driver.get(url_)
    soup=BeautifulSoup(driver.page_source, 'html5lib')
    print(soup.select('title'))
for url in urls.split():
    t=th.Thread(target=getinf, args=(url,))
    t.start()

When the script run the tabs are not opened at once as I expected(from threads) instead the process is done one by one and the title of last url(https://youtube.com) is only printed. when I try Multiprocessing , program crashes many times. I am making a web crawler and some websites(like twitter) requires JavaScript for showing content, so I can't use requests or urllib as well. What can be the solution for this. Any other library suggestion will be welcomed.

I don't want to develop a software for YouTube, twitter, etc seperately to extract data. I want whole in one . How can I do that? — user12320641
– user12320641, Commented Jan 19, 2020 at 14:28
If it has to be python, there's pyppeteer, otherwise puppeteer is a better choice. — pguardiario
– pguardiario, Commented Jan 20, 2020 at 2:34
Hi! Have you solved your problem? I also use seleniumwire as you do and encounter the same problem. I found multithreading is not thread safe, but seems seleniumwire can't be operated through multiprocess either due to its inner proxy. So I wonder have you already solved this problem? If so, then how have you solved it? — hans
– hans, Commented Aug 28, 2021 at 7:05

Jeni · Accepted Answer · 2020-01-19 19:23:09Z

Try putting the creation of the chromedriver in the thread code. Otherwise you have one driver, and you are changing the url of one and the same driver. Instead try to create separate chromedriver for each thread.

Note: I have not tried the code, just suggestion.

import threading  as th
import time
import base64
import mysql.connector as mysql
import requests
from bs4 import BeautifulSoup
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from functions import *

options = Options()
prefs = {'profile.default_content_setting_values': {'images': 2,'popups': 2, 'geolocation': 2, 
                            'notifications': 2, 'auto_select_certificate': 2, 'fullscreen': 2, 
                            'mouselock': 2, 'mixed_script': 2, 'media_stream': 2, 
                            'media_stream_mic': 2, 'media_stream_camera': 2, 'protocol_handlers': 2, 
                            'ppapi_broker': 2, 'automatic_downloads': 2, 'midi_sysex': 2, 
                            'push_messaging': 2, 'ssl_cert_decisions': 2, 'metro_switch_to_desktop': 2, 
                            'protected_media_identifier': 2, 'app_banner': 2, 'site_engagement': 2, 
                            'durable_storage': 2}}
print('Crawling process started')
options.add_experimental_option('prefs', prefs)
urls='https://google.com https://youtube.com'
def getinf(url_):
    driver = webdriver.Chrome(executable_path='chromedriver.exe', options=options)
    driver.set_page_load_timeout(50000)
    driver.get(url_)
    soup=BeautifulSoup(driver.page_source, 'html5lib')
    print(soup.select('title'))
for url in urls.split():
    t=th.Thread(target=getinf, args=(url,))
    t.start()

Collectives™ on Stack Overflow

How to use selenium with python in multithreaded way

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related