0

in this part of scraping code , I fetch alot of URLs from stored URLs in (url.xml) file and it is take so long to finish, how to implement (multiprocessing pool)

any simple code to fix this problem ? Thanks

from bs4 import BeautifulSoup as soup
import requests
from multiprocessing import Pool

p = Pool(10) # “10” means that 10 URLs will be processed at the same time
p.map

page_url = "url.xml"


out_filename = "prices.csv"
headers = "availableOffers,otherpricess,currentprice \n"

with open(out_filename, "w") as fw:
  fw.write(headers)
  with open("url.xml", "r") as fr:
    for url in map(lambda x: x.strip(), fr.readlines()): 
      print(url)
      response = requests.get(url)
      page_soup = soup(response.text, "html.parser")


      availableOffers = page_soup.find("input", {"id": "availableOffers"})
      otherpricess = page_soup.find("span", {"class": "price"})
      currentprice = page_soup.find("div", {"class": "is"})

      fw.write(availableOffers + ", " + otherpricess + ", " + currentprice + "\n")


p.terminate()
p.join()
5
  • Try this blog.adnansiddiqi.me/… Commented Apr 30, 2020 at 18:26
  • I added p.terminate() p.join() but still not work Commented Apr 30, 2020 at 18:32
  • your map call is pasted wrong here, and you need a if name__==' __main' Commented Apr 30, 2020 at 18:35
  • rest of the code added to the question , not sure what should what to add, help please Commented Apr 30, 2020 at 18:39
  • p.map is not right. p.map requires a function and iterable as arguments. you need to put your requests fetch logic in a function that takes a url as parameter. Also you need to add the call to pool(10) protected by a if name == 'main' so that the child processes wont create their own pools. Commented Apr 30, 2020 at 18:42

2 Answers 2

1

You can use concurrent.futures standard package in python for multiprocessing and multi-threading.

In, your case, you don't need multiprocessing, multi-threading will help. Because, your function in computationally expensive.

By, use of multi-threading, you can send multiple request at same time. number_of_threads argument can control the number of the request, you want to send at a time.

I have created a function, extract_data_from_url_func that will extract the data from single URL and i pass this function and list of URLS to multi-threading executor using concurrent.futures.

from bs4 import BeautifulSoup as soup
from concurrent.futures import ThreadPoolExecutor
import requests

page_url = "url.xml"
number_of_threads = 6
out_filename = "prices.csv"
headers = "availableOffers,otherpricess,currentprice \n"

def extract_data_from_url_func(url):
    print(url)
    response = requests.get(url)
    page_soup = soup(response.text, "html.parser")
    availableOffers = page_soup.find("input", {"id": "availableOffers"})["value"]
    otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "")
    currentprice = page_soup.find("div", {"class": "is"}).text.strip().replace("$", "")
    output_list = [availableOffers, otherpricess, currentprice]
    output = ",".join(output_list)
    print(output)
    return output

with open("url.xml", "r") as fr:
    URLS = list(map(lambda x: x.strip(), fr.readlines()))

with ThreadPoolExecutor(max_workers=number_of_threads) as executor:
    results = executor.map( extract_data_from_url_func, URLS)
    responses = []
    for result in results:
        responses.append(result)


with open(out_filename, "w") as fw:
  fw.write(headers)
  for response in responses:
      fw.write(response)

reference: https://docs.python.org/3/library/concurrent.futures.html

Sign up to request clarification or add additional context in comments.

7 Comments

code run without errors , but the file (prices.csv) have no values inside it, only headers = "availableOffers,otherpricess,currentprice"
could you please try to print results first, so that we can be sure about, that the function is working fine?
yes, it is print the URLs on the screen and running very fast, but (prices.csv) have no values
Sorry about that, i have made some changes to the code, could you please check with new code.
Is the print(responses) showing something in output.
|
0

It must be something of this form. Please make changes so that urls being passed to p.map is a list of urls:

from bs4 import BeautifulSoup as soup
import requests
from multiprocessing import Pool
import csv

def parse(url):
    response = requests.get(url)
    page_soup = soup(response.text, "html.parser")
    availableOffers = page_soup.find("input", {"id": "availableOffers"})["value"]
    otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "")
    currentprice = page_soup.find("div", {"class": "is"}).text.strip().replace("$", "")
    return availableOffers, otherpricess, currentprice


if __name__ == '__main__':
    urls = [ ... ]  # List of urls to fetch from
    p = Pool(10) # “10” means that 10 URLs will be processed at the same time
    records = p.map(parse, urls)
    p.terminate()
    p.join()
    with open("outfile.csv", "w") as csvfile:
        writer = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_MINIMAL)
        for r in records:
             writer.writerow(r)

3 Comments

line 15 if name == 'main' ^ SyntaxError: invalid syntax
missing : , added now. Are you new to python?
Then please do this a step at a time instead of a direct leap into process pools. Break the problem into multiple steps and tackle the simplest first.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.