Add multithreading or asynchronous to web scrape

Question

What is the best way to implement multithreading to make a web scrape faster? Would using Pool be a good solution - if so where in my code would I implement it?

import requests
from multiprocessing import Pool

with open('testing.txt', 'w') as outfile:
    results = []
    for number in (4,8,5,7,3,10):
        url = requests.get('https://www.google.com/' + str(number))
        response =(url)
        results.append(response.text)
        print(results)

    outfile.write("\n".join(results))

Take a look at scrapy which is a great toolkit for scraping. — tdelaney
– tdelaney, Commented Apr 15, 2018 at 15:14
Is it required to write every request to a single file? I think with multithreding that will have problems. — raul.vila
– raul.vila, Commented Apr 15, 2018 at 15:15
If I don't write to file my interpreter cuts off if the file is too large. — Mia Ella
– Mia Ella, Commented Apr 15, 2018 at 15:20

tdelaney · Accepted Answer · 2018-04-15 15:27:49Z

3

This can be moved to a pool easily. Python comes with process and thread based pools. Which to use is a tradeoff. Processes work better for parallelizing running code but are more expensive when passing results back to the main program. In your case, your code is mostly waiting on urls and has a relatively large return object, so thread pools make sense.

I moved the code inside a if __name__ as needed on windows machines.

import requests
from multiprocessing import Pool
from multiprocessing.pool import ThreadPool

def worker(number):
    url = requests.get('https://www.google.com/' + str(number))
    return url.text

# put some sort of cap on outstanding requests...
MAX_URL_REQUESTS = 10

if __name__ == "__main__":
    numbers = (4,8,5,7,3,10)
    with ThreadPool(min(len(numbers), MAX_URL_REQUESTS)) as pool:
        with open('testing.txt', 'w') as outfile:
            for result in pool.map(worker, numbers, chunksize=1):
                outfile.write(result)
                outfile.write('\n')

edited Apr 15, 2018 at 15:27

answered Apr 15, 2018 at 15:22

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

raul.vila Over a year ago

Now I understand (...from previous deleted comments)

tdelaney Over a year ago

My bad, I meant min(len(numbers), MAX_URL_REQUESTS). That's the number of threads to use in the pool. I took the minimum of some arbirary "max number of threads" and the number actaully needed for the problem. Fixed in code.

Mia Ella Over a year ago

tdelaney - Awesome! Thanks I'd ike to use tqdm or some sort of progress bar. Where would I implement that?

tdelaney Over a year ago

You could update progress in the for result in pool... loop. map returns results in the same order you passed in the parameters. Results would be returned faster (making progress more accurate) using pool.imap_unordered. You could even pass the original number back with the results as a tuple if the association is important.

Collectives™ on Stack Overflow

Add multithreading or asynchronous to web scrape

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related