How to parallel download and parse HTML files using Python 3?

Question

I was trying to download a long list of HTML files from the internet onto my computer, and then use BeautifulSoup to scrape those files from my computer. It's a long story why I want to first save them onto my computer before scraping, so I'll save you the trouble by not writing an essay!

Anyway, for me, requests module is too slow when dealing with many URL's, so I decided to stick with urllib and use multiprocessing/threadpooling to make the request functions run in parallel (so it's quicker than requesting each file one after another).

My problem is: what I want to do is save each HTML/URL independently - that is, I want to store each HTML file separately, instead of writing all of the HTML's into one file. While multiprocessing and urllib can request HTML's in parallel, I couldn't find out how to download (or save/write to txt) each HTML separately.

I'm imagining something like the general example I just made up below, where each request within the parallel function will get preformed in parallel.

parallel(

request1
request2
request3
...

)

The reason for wanting it to be like this is so that I can use the same simple script structure for the next step: parsing the HTML's with BeautifulSoup. Like how I had separate request functions for each URL on the first part, I'll need separate parse functions for each HTML because each HTML's structure is different. If you have a different solution, that's okay as well, I'm just trying to explain my thought; it doesn't have to be like this.

Is it possible to do this (both requesting separately and parsing separately) using multiprocessing (or any other libraries)? I spend the whole day yesterday on StackOverflow trying to find similar questions, but many involve using complex things like eventlet or scrapy, and none mention downloading each HTML into separate files and parsing them individually, but in parallel.

mozway · Accepted Answer · 2022-01-28 15:00:45Z

2

It's possible for sure (: Just write single thread function which will do all you need from start to finish and then execute it in multiprocessing pool eg.

from multiprocessing import Pool

def my_function(url_to_parse):
    request()...
    parse()...
    save_with_unique_filename()
    return result[optional]

NUM_OF_PROCS = 10
pool = Pool(NUM_OF_PROCS)
pool.map(my_function, [list_of_urls_to_parse])

edited Jan 28, 2022 at 15:00

mozway

267k13 gold badges56 silver badges106 bronze badges

answered Jul 14, 2019 at 9:49

Michael Savchenko

1,4551 gold badge9 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

George Orwell Over a year ago

Hey, thanks so much! This works great for the requests, and speeds things up! However, regarding the parsing part, it doesn't work very well because each URL/document has a different structure, so one parsing function doesn't work for all. If I add more parsing functions, it will try each parsing function on every URL, making it unnecessarily slow. If you have any ideas for overcoming this, let me know! I can post another question and you can answer it for the credit! :)

Michael Savchenko Over a year ago

It fully depends on what you're planning to parse. If you're having fixed number of html patterns you can just add more conditions in parsing part. If you're extracting data with some sort of regular format you can use regular expressions. etc.

Collectives™ on Stack Overflow

How to parallel download and parse HTML files using Python 3?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related