2

I'm making a webscraping app in Python with Django web framework. I need to scrape multiple queries using beautifulsoup library. Here is snapshot of code that I have written:

for url in websites:
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    links = soup.find_all("a", {"class":"dev-link"})

Actually here the scraping of webpage is going sequentially, I want to run it in parallel manner. I don't have much idea about threading in Python. can someone tell me, How can I do scrape in parallel manner? Any help would be appreciated.

1
  • how many webpage are you trying to scrape at a time? Commented May 29, 2017 at 15:17

3 Answers 3

1

Try this solution.

import threading

def fetch_links(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    return soup.find_all("a", {"class": "dev-link"})

threads = [threading.Thread(target=fetch_links, args=(url,))
           for url in websites]

for t in thread:
    t.start()

Downloading web page content via requests.get() is a blocking operation, and Python threading can actually improve performance.

Sign up to request clarification or add additional context in comments.

Comments

1

If you want to use multithreading then,

import threading
import requests
from bs4 import BeautifulSoup

class Scraper(threading.Thread):
    def __init__(self, threadId, name, url):
        threading.Thread.__init__(self)
        self.name = name
        self.id = threadId
        self.url = url

    def run(self):
        r = requests.get(self.url)
        soup = BeautifulSoup(r.content, 'html.parser')
        links = soup.find_all("a")
        return links
#list the websites in below list
websites = []
i = 1
for url in websites:
    thread = Scraper(i, "thread"+str(i), url)
    res = thread.run()
    # print res

this might be helpful

Comments

0

when it comes to python and scraping, scrapy is probably the way to go.

scrapy is using twisted mertix library for parallelism so you dont have to worry about threading and the python GIL

If you must use beautifulsoap check this library out

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.