Avoid duplicate result Multithread Python

Question

I'm trying to make my actual crawler Multithread.
When I set the Multithread, several instance of the function will be started.

Exemple :

If my function I use print range(5) and I will have 1,1,2,2,3,3,4,4,5,5 if I have 2 Thread.

How can can I have the result 1,2,3,4,5 in Multithread ?

My actual code is a crawler as you can see under :

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "http://stackoverflow.com/questions?page=" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll('a', {'class': 'question-hyperlink'}):
            href = link.get('href')
            title = link.string
            print(title)
            get_single_item_data("http://stackoverflow.com/" + href)
        page += 1

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    res = soup.find('span', {'class': 'vote-count-post '})
    print("UpVote : " + res.string)

trade_spider(1)

How can I call trade_spider() in Multithread without duplicate link ?

@DavidCullen Can you give me an exemple please, I don't understant how the shared multiprocessing works in the docs. Thank you — Pixel
– Pixel, Commented Aug 16, 2016 at 16:37

danny · Accepted Answer · 2016-08-16 17:29:39Z

1

Have the page number be an argument to the trade_spider function.

Call the function in each process with a different page number so that each thread gets a unique page.

For example:

import multiprocessing

def trade_spider(page):
    url = "http://stackoverflow.com/questions?page=%s" % (page,)
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    for link in soup.findAll('a', {'class': 'question-hyperlink'}):
        href = link.get('href')
        title = link.string
        print(title)
        get_single_item_data("http://stackoverflow.com/" + href)

# Pool of 10 processes
max_pages = 100
num_pages = range(1, max_pages)
pool = multiprocessing.Pool(10)
# Run and wait for completion.
# pool.map returns results from the trade_spider
# function call but that returns nothing
# so ignoring it
pool.map(trade_spider, num_pages)

edited Aug 16, 2016 at 17:29

answered Aug 16, 2016 at 16:37

danny

5,2801 gold badge22 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Pixel Over a year ago

Can I have an example please.

danny Over a year ago

Updated with example

user3657941 · Accepted Answer · 2016-08-16 17:38:03Z

1

Try this:

from multiprocessing import Process, Value
import time

max_pages = 100
shared_page = Value('i', 1)
arg_list = (max_pages, shared_page)
process_list = list()
for x in range(2):
    spider_process = Process(target=trade_spider, args=arg_list)
    spider_process.daemon = True
    spider_process.start()
    process_list.append(spider_process)
for spider_process in process_list:
    while spider_process.is_alive():
        time.sleep(1.0)
    spider_process.join()

Change the parameter list of trade_spider to

def trade_spider(max_pages, page)

and remove the

    page = 1

This will create two processes that will work through the page list by sharing the page value.

answered Aug 16, 2016 at 17:38

user3657941

Collectives™ on Stack Overflow

Avoid duplicate result Multithread Python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related