Multithread python request in webscraping

Question

So I have this code (from BeautifulSoup : how to show the inside of a div that won't show?), but I was wondering if any of you knew how to speed up the result process ?

It takes the glossary entries of a website and creates text file with them, but because I will do the same with several websites in several languages, it is a bit too slow at the moment.

If anyone has any idea or insights, I'll be happy to read them !

import requests
import json

r = requests.get('http://winevibe.com/wp-json/glossary/key/?l=en').json()
data = json.loads(r)
result = ([(item['key'], item['id']) for item in data])
text = []
for item in result:
    try:
        r = requests.get(
            f"http://winevibe.com/wp-json/glossary/text/?id={item[1]}").json()
        data = json.loads(r)
        print(f"Getting Text For: {item[0]}")
        text.append(data[0]['text'])
    except KeyboardInterrupt:
        print('Good Bye')
        break

with open('result.txt', 'w+') as f:
    for a, b in zip(result, text):
        lines = ', '.join([a[0], b.replace('\n', '')]) + '\n'
        f.write(lines)

you can use Threads, Multipocesses or asyncio. requests should have extensions with all these methods - ie. see requests_toolbelt for requests + Threads, or aiohttp-requests for requests with asyncio — furas
– furas, Commented Nov 28, 2019 at 20:06
You can also go with scrapy. The learning curve is a bit hard but its super fast. — Prithvi Singh
– Prithvi Singh, Commented Nov 29, 2019 at 5:44

mradey · Accepted Answer · 2019-11-29 06:22:01Z

If you are looking for an easy way to speed performance without a lot of overhead, the threading library is simple to get started with. Here is an example (though not very practical) of the use:

import threading
import requests as r 

#store request into list.  Do any altering to response here
def get_url_content(url,idx,results):
  results[idx] = str(r.get(url).content)

urls = ['https://tasty.co/compilation/10-supreme-lasagna-recipes' for i in range(1000)]
results = [None for ele in urls]
num_threads = 20
start = time.time()
threads = []
i = 0
while len(urls) > 0:
  if len(urls) > num_threads:
    url_sub_li = urls[:num_threads]
    urls = urls[num_threads:]
  else:
    url_sub_li = urls
    urls = []

  #create a thread for each url to scrape
  for url in url_sub_li:
    t = threading.Thread(target=get_url_content, args=(url,i,results))
    threads.append(t)
    i+=1
  #start each thread
  for t in threads:
    t.start()
  #wait for each thread to finish
  for t in threads:
    t.join()
  threads = []

when num_threads was varied from 5 to 95 incrementing by 5 these were the results:

5 threads took 15.603618860244751 seconds
10 threads took 12.467495679855347 seconds
15 threads took 12.416464805603027 seconds
20 threads took 12.120754957199097 seconds
25 threads took 11.872958421707153 seconds
30 threads took 11.743015766143799 seconds
35 threads took 11.87484860420227 seconds
40 threads took 11.65029239654541 seconds
45 threads took 11.6738121509552 seconds
50 threads took 11.400196313858032 seconds
55 threads took 11.399579286575317 seconds
60 threads took 11.302385807037354 seconds
65 threads took 11.301892280578613 seconds
70 threads took 11.088538885116577 seconds
75 threads took 11.60099172592163 seconds
80 threads took 11.280904531478882 seconds
85 threads took 11.361995935440063 seconds
90 threads took 11.376339435577393 seconds
95 threads took 11.090314388275146 seconds

If the same program was run in serial:

urls = ['https://tasty.co/compilation/10-supreme-lasagna-recipes' for i in range(1000)]
results = [str(r.get(url).content) for url in urls]

the time was: 51.39667201042175 seconds

The Threading library here improves performance by ~5 times and is very simple to integrate with your code. As mentioned in comments there are other libraries, which could provide even better performance, but this is a very useful one for simple integration.

Collectives™ on Stack Overflow

Multithread python request in webscraping

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related