in Python, how to implement (multiprocessing pool)

Question

in this part of scraping code , I fetch alot of URLs from stored URLs in (url.xml) file and it is take so long to finish, how to implement (multiprocessing pool)

any simple code to fix this problem ? Thanks

from bs4 import BeautifulSoup as soup
import requests
from multiprocessing import Pool

p = Pool(10) # “10” means that 10 URLs will be processed at the same time
p.map

page_url = "url.xml"


out_filename = "prices.csv"
headers = "availableOffers,otherpricess,currentprice \n"

with open(out_filename, "w") as fw:
  fw.write(headers)
  with open("url.xml", "r") as fr:
    for url in map(lambda x: x.strip(), fr.readlines()): 
      print(url)
      response = requests.get(url)
      page_soup = soup(response.text, "html.parser")


      availableOffers = page_soup.find("input", {"id": "availableOffers"})
      otherpricess = page_soup.find("span", {"class": "price"})
      currentprice = page_soup.find("div", {"class": "is"})

      fw.write(availableOffers + ", " + otherpricess + ", " + currentprice + "\n")


p.terminate()
p.join()

your map call is pasted wrong here, and you need a if name__==' __main' — Luv
– Luv, Commented Apr 30, 2020 at 18:35
rest of the code added to the question , not sure what should what to add, help please — user2653711
– user2653711, Commented Apr 30, 2020 at 18:39
p.map is not right. p.map requires a function and iterable as arguments. you need to put your requests fetch logic in a function that takes a url as parameter. Also you need to add the call to pool(10) protected by a if name == 'main' so that the child processes wont create their own pools. — Luv
– Luv, Commented Apr 30, 2020 at 18:42

Saurabh Kansal · Accepted Answer · 2020-04-30 19:15:14Z

1

You can use concurrent.futures standard package in python for multiprocessing and multi-threading.

In, your case, you don't need multiprocessing, multi-threading will help. Because, your function in computationally expensive.

By, use of multi-threading, you can send multiple request at same time. number_of_threads argument can control the number of the request, you want to send at a time.

I have created a function, extract_data_from_url_func that will extract the data from single URL and i pass this function and list of URLS to multi-threading executor using concurrent.futures.

from bs4 import BeautifulSoup as soup
from concurrent.futures import ThreadPoolExecutor
import requests

page_url = "url.xml"
number_of_threads = 6
out_filename = "prices.csv"
headers = "availableOffers,otherpricess,currentprice \n"

def extract_data_from_url_func(url):
    print(url)
    response = requests.get(url)
    page_soup = soup(response.text, "html.parser")
    availableOffers = page_soup.find("input", {"id": "availableOffers"})["value"]
    otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "")
    currentprice = page_soup.find("div", {"class": "is"}).text.strip().replace("$", "")
    output_list = [availableOffers, otherpricess, currentprice]
    output = ",".join(output_list)
    print(output)
    return output

with open("url.xml", "r") as fr:
    URLS = list(map(lambda x: x.strip(), fr.readlines()))

with ThreadPoolExecutor(max_workers=number_of_threads) as executor:
    results = executor.map( extract_data_from_url_func, URLS)
    responses = []
    for result in results:
        responses.append(result)


with open(out_filename, "w") as fw:
  fw.write(headers)
  for response in responses:
      fw.write(response)

reference: https://docs.python.org/3/library/concurrent.futures.html

edited Apr 30, 2020 at 19:15

answered Apr 30, 2020 at 18:53

Saurabh Kansal

7027 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

user2653711 Over a year ago

code run without errors , but the file (prices.csv) have no values inside it, only headers = "availableOffers,otherpricess,currentprice"

Saurabh Kansal Over a year ago

could you please try to print results first, so that we can be sure about, that the function is working fine?

user2653711 Over a year ago

yes, it is print the URLs on the screen and running very fast, but (prices.csv) have no values

Saurabh Kansal Over a year ago

Sorry about that, i have made some changes to the code, could you please check with new code.

Saurabh Kansal Over a year ago

Is the print(responses) showing something in output.

|

Luv · Accepted Answer · 2020-04-30 18:56:44Z

0

It must be something of this form. Please make changes so that urls being passed to p.map is a list of urls:

from bs4 import BeautifulSoup as soup
import requests
from multiprocessing import Pool
import csv

def parse(url):
    response = requests.get(url)
    page_soup = soup(response.text, "html.parser")
    availableOffers = page_soup.find("input", {"id": "availableOffers"})["value"]
    otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "")
    currentprice = page_soup.find("div", {"class": "is"}).text.strip().replace("$", "")
    return availableOffers, otherpricess, currentprice


if __name__ == '__main__':
    urls = [ ... ]  # List of urls to fetch from
    p = Pool(10) # “10” means that 10 URLs will be processed at the same time
    records = p.map(parse, urls)
    p.terminate()
    p.join()
    with open("outfile.csv", "w") as csvfile:
        writer = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_MINIMAL)
        for r in records:
             writer.writerow(r)

edited Apr 30, 2020 at 18:56

answered Apr 30, 2020 at 18:49

Luv

4966 silver badges17 bronze badges

3 Comments

user2653711 Over a year ago

line 15 if name == 'main' ^ SyntaxError: invalid syntax

Luv Over a year ago

missing : , added now. Are you new to python?

Luv Over a year ago

Then please do this a step at a time instead of a direct leap into process pools. Break the problem into multiple steps and tackle the simplest first.

Collectives™ on Stack Overflow

in Python, how to implement (multiprocessing pool)

2 Answers 2

7 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related