1


I'm trying to make my actual crawler Multithread.
When I set the Multithread, several instance of the function will be started.

Exemple :

If my function I use print range(5) and I will have 1,1,2,2,3,3,4,4,5,5 if I have 2 Thread.

How can can I have the result 1,2,3,4,5 in Multithread ?

My actual code is a crawler as you can see under :

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "http://stackoverflow.com/questions?page=" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll('a', {'class': 'question-hyperlink'}):
            href = link.get('href')
            title = link.string
            print(title)
            get_single_item_data("http://stackoverflow.com/" + href)
        page += 1

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    res = soup.find('span', {'class': 'vote-count-post '})
    print("UpVote : " + res.string)

trade_spider(1)

How can I call trade_spider() in Multithread without duplicate link ?

3
  • Have you tried using a shared multiprocessing.Value? Commented Aug 16, 2016 at 16:26
  • Not yet, I will tried Commented Aug 16, 2016 at 16:27
  • @DavidCullen Can you give me an exemple please, I don't understant how the shared multiprocessing works in the docs. Thank you Commented Aug 16, 2016 at 16:37

2 Answers 2

1

Have the page number be an argument to the trade_spider function.

Call the function in each process with a different page number so that each thread gets a unique page.

For example:

import multiprocessing

def trade_spider(page):
    url = "http://stackoverflow.com/questions?page=%s" % (page,)
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    for link in soup.findAll('a', {'class': 'question-hyperlink'}):
        href = link.get('href')
        title = link.string
        print(title)
        get_single_item_data("http://stackoverflow.com/" + href)

# Pool of 10 processes
max_pages = 100
num_pages = range(1, max_pages)
pool = multiprocessing.Pool(10)
# Run and wait for completion.
# pool.map returns results from the trade_spider
# function call but that returns nothing
# so ignoring it
pool.map(trade_spider, num_pages)
Sign up to request clarification or add additional context in comments.

2 Comments

Can I have an example please.
Updated with example
1

Try this:

from multiprocessing import Process, Value
import time

max_pages = 100
shared_page = Value('i', 1)
arg_list = (max_pages, shared_page)
process_list = list()
for x in range(2):
    spider_process = Process(target=trade_spider, args=arg_list)
    spider_process.daemon = True
    spider_process.start()
    process_list.append(spider_process)
for spider_process in process_list:
    while spider_process.is_alive():
        time.sleep(1.0)
    spider_process.join()

Change the parameter list of trade_spider to

def trade_spider(max_pages, page)

and remove the

    page = 1

This will create two processes that will work through the page list by sharing the page value.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.