I'm scraping articles from a news site on behalf of the owner. I have to keep it to <= 5 requests per second, or ~100k articles in 6 hrs (overnight), but I'm getting ~30k at best.
Using Jupyter notebook, it runs fine @ first, but becomes less and less responsive. After 6 hrs, the kernel is normally un-interruptable, and I have to restart it. Since I'm storing each article in-memory, this is a problem.
So my question is: is there a more efficient way to do this to reach ~100k articles in 6 hours?
The code is below. For each valid URL in a Pandas dataframe column, the loop:
- downloads the webpage
- extracts the relevant text
- cleans out some encoding garbage from the text
- writes that text to another dataframe column
- every 2000 articles, it saves the dataframe to a CSV (overwriting the last backup), to handle the eventual crash of the script.
Some ideas I've considered:
- Write each article to a local SQL server instead of in-mem (speed concerns?)
- save each article text in a csv with its url, then build a dataframe later
delete all "print()" functions and rely solely on logging (my logger config doesn't seem to perform awesome, though--i'm not sure it's logging everything I tell it to)
i=0 #lots of NaNs in the column, hence the subsetting for u in unique_urls[unique_urls['unique_suffixes'].isnull() == False]\ .unique_suffixes.values[:]: i = i+1 if pd.isnull(u): continue #save our progress every 2k articles just in case if i%2000 == 0: unique_urls.to_csv('/backup-article-txt.csv', encoding='utf-8') try: #pull the data html_r = requests.get(u).text #the phrase "TX:" indicates start of article #text, so if it's not present, URL must have been bad if html_r.find("TX:") == -1: continue #capture just the text of the article txt = html_r[html_r.find("TX:")+5:] #fix encoding/formatting quirks txt = txt.replace('\n',' ') txt = txt.replace('[^\x00-\x7F]','') #wait 200 ms to spare site's servers time.sleep(.2) #write our article to our dataframe unique_urls.loc[unique_urls.unique_suffixes == u, 'article_text'] = txt logging.info("done with url # %s -- %s remaining", i, (total_links-i)) print "done with url # " + str(i) print total_links-i except: logging.exception("Exception on article # %s, URL: %s", i, u) print "ERROR with url # " + str(i) continue
This is the logging config I'm using. I found it on SO, but w/ this particular script it doesn't seem to capture everything.
logTime = "{:%d %b-%X}".format(datetime.datetime.now())
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='logTime+'.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.INFO)
eta: some details in response to answers/comments:
script is only thing running on a 16 GB/ram EC2 instance
articles are ~100-800 words apiece