2

I was trying to download a long list of HTML files from the internet onto my computer, and then use BeautifulSoup to scrape those files from my computer. It's a long story why I want to first save them onto my computer before scraping, so I'll save you the trouble by not writing an essay!

Anyway, for me, requests module is too slow when dealing with many URL's, so I decided to stick with urllib and use multiprocessing/threadpooling to make the request functions run in parallel (so it's quicker than requesting each file one after another).

My problem is: what I want to do is save each HTML/URL independently - that is, I want to store each HTML file separately, instead of writing all of the HTML's into one file. While multiprocessing and urllib can request HTML's in parallel, I couldn't find out how to download (or save/write to txt) each HTML separately.

I'm imagining something like the general example I just made up below, where each request within the parallel function will get preformed in parallel.

parallel(

request1
request2
request3
...

)

The reason for wanting it to be like this is so that I can use the same simple script structure for the next step: parsing the HTML's with BeautifulSoup. Like how I had separate request functions for each URL on the first part, I'll need separate parse functions for each HTML because each HTML's structure is different. If you have a different solution, that's okay as well, I'm just trying to explain my thought; it doesn't have to be like this.

Is it possible to do this (both requesting separately and parsing separately) using multiprocessing (or any other libraries)? I spend the whole day yesterday on StackOverflow trying to find similar questions, but many involve using complex things like eventlet or scrapy, and none mention downloading each HTML into separate files and parsing them individually, but in parallel.

1 Answer 1

2

It's possible for sure (: Just write single thread function which will do all you need from start to finish and then execute it in multiprocessing pool eg.

from multiprocessing import Pool

def my_function(url_to_parse):
    request()...
    parse()...
    save_with_unique_filename()
    return result[optional]

NUM_OF_PROCS = 10
pool = Pool(NUM_OF_PROCS)
pool.map(my_function, [list_of_urls_to_parse])
Sign up to request clarification or add additional context in comments.

2 Comments

Hey, thanks so much! This works great for the requests, and speeds things up! However, regarding the parsing part, it doesn't work very well because each URL/document has a different structure, so one parsing function doesn't work for all. If I add more parsing functions, it will try each parsing function on every URL, making it unnecessarily slow. If you have any ideas for overcoming this, let me know! I can post another question and you can answer it for the credit! :)
It fully depends on what you're planning to parse. If you're having fixed number of html patterns you can just add more conditions in parsing part. If you're extracting data with some sort of regular format you can use regular expressions. etc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.