Optimizing web-scraper python loop

Question

I'm scraping articles from a news site on behalf of the owner. I have to keep it to <= 5 requests per second, or ~100k articles in 6 hrs (overnight), but I'm getting ~30k at best.

Using Jupyter notebook, it runs fine @ first, but becomes less and less responsive. After 6 hrs, the kernel is normally un-interruptable, and I have to restart it. Since I'm storing each article in-memory, this is a problem.

So my question is: is there a more efficient way to do this to reach ~100k articles in 6 hours?

The code is below. For each valid URL in a Pandas dataframe column, the loop:

downloads the webpage
extracts the relevant text
cleans out some encoding garbage from the text
writes that text to another dataframe column
every 2000 articles, it saves the dataframe to a CSV (overwriting the last backup), to handle the eventual crash of the script.

Some ideas I've considered:

Write each article to a local SQL server instead of in-mem (speed concerns?)
save each article text in a csv with its url, then build a dataframe later

delete all "print()" functions and rely solely on logging (my logger config doesn't seem to perform awesome, though--i'm not sure it's logging everything I tell it to)

i=0
#lots of NaNs in the column, hence the subsetting
for u in unique_urls[unique_urls['unique_suffixes'].isnull() == False]\
.unique_suffixes.values[:]:

  i = i+1

  if pd.isnull(u): 
      continue

  #save our progress every 2k articles just in case
  if i%2000 == 0:
      unique_urls.to_csv('/backup-article-txt.csv', encoding='utf-8')

  try:
      #pull the data
      html_r = requests.get(u).text

      #the phrase "TX:" indicates start of article 
      #text, so if it's not present, URL must have been bad
      if html_r.find("TX:") == -1: 
          continue

      #capture just the text of the article
      txt = html_r[html_r.find("TX:")+5:]

      #fix encoding/formatting quirks
      txt = txt.replace('\n',' ') 
      txt = txt.replace('[^\x00-\x7F]','')



      #wait 200 ms to spare site's servers
      time.sleep(.2)

      #write our article to our dataframe
      unique_urls.loc[unique_urls.unique_suffixes == u, 'article_text'] = txt


      logging.info("done with url # %s -- %s remaining", i, (total_links-i))


      print "done with url # " + str(i)
      print total_links-i

  except:
      logging.exception("Exception on article # %s, URL: %s", i, u)
      print "ERROR with url # " + str(i)
      continue

This is the logging config I'm using. I found it on SO, but w/ this particular script it doesn't seem to capture everything.

logTime = "{:%d %b-%X}".format(datetime.datetime.now())
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='logTime+'.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)


logger.setLevel(logging.INFO)

eta: some details in response to answers/comments:

script is only thing running on a 16 GB/ram EC2 instance
articles are ~100-800 words apiece

scooter me fecit · Accepted Answer · 2016-03-17 17:11:12Z

1

I'm going to take an educated guess and say that your script turns your machine into a swap storm as you get around 30k articles, according to your description. I don't see anything in your code where you could easily free up memory using:

some_large_container = None

Setting something that you know has a large allocation to None tells Python's memory manager that it's available for garbage collection. You also might want to explicitly call gc.collect(), but I'm not sure that would do you much good.

Alternatives you could consider:

sqlite3: Instead of a remote SQL database, use sqlite3 as intermediate storage. Exists there does a Python module.
Keep appending to the CSV checkpoint file.
Compress your strings with zlib.compress().

Any way that you decide to go, you're probably best off doing the collection as phase 1, constructing the Pandas dataframe as phase 2. Never pays off to be clever by a half. The other half tends to hang you.

answered Mar 17, 2016 at 17:11

scooter me fecit

1,0835 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Brandon Over a year ago

Thanks @ScottM. re: swaps, the script runs on 16GB AWS EC2, and it's the only thing going--i haven't kept track of memory usage, but is it likely 30k articles would spill into swap? articles being ~100-800 words apiece. (but either way, appending to some kind of persistent on-disk storage seems like it'd be more efficient than what I'm doing currently)

scooter me fecit Over a year ago

Just an educated guess, given that the system becomes totally unresponsive. Sure, 16GB is a lot of RAM for a virtual machine, but that doesn't actually mean that the underlying AWS hardware node is actually giving you 16GB -- hypervisors do tend to overcommit VM RAM, since in the average case, the VM doesn't actually use all of the underlying RAM ("memory ballooning" if memory serves me correctly.)

scooter me fecit Over a year ago

Aksi check your process limits, limit at bash or zsh prompt. That could also limit, no pun intended, your total memory usage.

Brandon Over a year ago

thx Scott-for posterity: i took your advice & stored article results in a list of tuples (url, text, index#). then, every 1000 iterations, i converted the list of tuples to a DataFrame and wrote a new CSV w/ a unique filename to merge later (felt that was more reliable than appending). i tried cutting out the dataframe altogether, but the python csv module wasn't playing nicely w/ my string encodings.

Collectives™ on Stack Overflow

Optimizing web-scraper python loop

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related