2

What is the best way to implement multithreading to make a web scrape faster? Would using Pool be a good solution - if so where in my code would I implement it?

import requests
from multiprocessing import Pool

with open('testing.txt', 'w') as outfile:
    results = []
    for number in (4,8,5,7,3,10):
        url = requests.get('https://www.google.com/' + str(number))
        response =(url)
        results.append(response.text)
        print(results)

    outfile.write("\n".join(results))
3
  • Take a look at scrapy which is a great toolkit for scraping. Commented Apr 15, 2018 at 15:14
  • Is it required to write every request to a single file? I think with multithreding that will have problems. Commented Apr 15, 2018 at 15:15
  • If I don't write to file my interpreter cuts off if the file is too large. Commented Apr 15, 2018 at 15:20

1 Answer 1

3

This can be moved to a pool easily. Python comes with process and thread based pools. Which to use is a tradeoff. Processes work better for parallelizing running code but are more expensive when passing results back to the main program. In your case, your code is mostly waiting on urls and has a relatively large return object, so thread pools make sense.

I moved the code inside a if __name__ as needed on windows machines.

import requests
from multiprocessing import Pool
from multiprocessing.pool import ThreadPool

def worker(number):
    url = requests.get('https://www.google.com/' + str(number))
    return url.text

# put some sort of cap on outstanding requests...
MAX_URL_REQUESTS = 10

if __name__ == "__main__":
    numbers = (4,8,5,7,3,10)
    with ThreadPool(min(len(numbers), MAX_URL_REQUESTS)) as pool:
        with open('testing.txt', 'w') as outfile:
            for result in pool.map(worker, numbers, chunksize=1):
                outfile.write(result)
                outfile.write('\n')
Sign up to request clarification or add additional context in comments.

4 Comments

Now I understand (...from previous deleted comments)
My bad, I meant min(len(numbers), MAX_URL_REQUESTS). That's the number of threads to use in the pool. I took the minimum of some arbirary "max number of threads" and the number actaully needed for the problem. Fixed in code.
tdelaney - Awesome! Thanks I'd ike to use tqdm or some sort of progress bar. Where would I implement that?
You could update progress in the for result in pool... loop. map returns results in the same order you passed in the parameters. Results would be returned faster (making progress more accurate) using pool.imap_unordered. You could even pass the original number back with the results as a tuple if the association is important.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.