0

I'm scraping articles from a news site on behalf of the owner. I have to keep it to <= 5 requests per second, or ~100k articles in 6 hrs (overnight), but I'm getting ~30k at best.

Using Jupyter notebook, it runs fine @ first, but becomes less and less responsive. After 6 hrs, the kernel is normally un-interruptable, and I have to restart it. Since I'm storing each article in-memory, this is a problem.

So my question is: is there a more efficient way to do this to reach ~100k articles in 6 hours?

The code is below. For each valid URL in a Pandas dataframe column, the loop:

  1. downloads the webpage
  2. extracts the relevant text
  3. cleans out some encoding garbage from the text
  4. writes that text to another dataframe column
  5. every 2000 articles, it saves the dataframe to a CSV (overwriting the last backup), to handle the eventual crash of the script.

Some ideas I've considered:

  1. Write each article to a local SQL server instead of in-mem (speed concerns?)
  2. save each article text in a csv with its url, then build a dataframe later
  3. delete all "print()" functions and rely solely on logging (my logger config doesn't seem to perform awesome, though--i'm not sure it's logging everything I tell it to)

    i=0
    #lots of NaNs in the column, hence the subsetting
    for u in unique_urls[unique_urls['unique_suffixes'].isnull() == False]\
    .unique_suffixes.values[:]:
    
      i = i+1
    
      if pd.isnull(u): 
          continue
    
      #save our progress every 2k articles just in case
      if i%2000 == 0:
          unique_urls.to_csv('/backup-article-txt.csv', encoding='utf-8')
    
      try:
          #pull the data
          html_r = requests.get(u).text
    
          #the phrase "TX:" indicates start of article 
          #text, so if it's not present, URL must have been bad
          if html_r.find("TX:") == -1: 
              continue
    
          #capture just the text of the article
          txt = html_r[html_r.find("TX:")+5:]
    
          #fix encoding/formatting quirks
          txt = txt.replace('\n',' ') 
          txt = txt.replace('[^\x00-\x7F]','')
    
    
    
          #wait 200 ms to spare site's servers
          time.sleep(.2)
    
          #write our article to our dataframe
          unique_urls.loc[unique_urls.unique_suffixes == u, 'article_text'] = txt
    
    
          logging.info("done with url # %s -- %s remaining", i, (total_links-i))
    
    
          print "done with url # " + str(i)
          print total_links-i
    
      except:
          logging.exception("Exception on article # %s, URL: %s", i, u)
          print "ERROR with url # " + str(i)
          continue
    

This is the logging config I'm using. I found it on SO, but w/ this particular script it doesn't seem to capture everything.

logTime = "{:%d %b-%X}".format(datetime.datetime.now())
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='logTime+'.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)


logger.setLevel(logging.INFO)

eta: some details in response to answers/comments:

  • script is only thing running on a 16 GB/ram EC2 instance

  • articles are ~100-800 words apiece

1 Answer 1

1

I'm going to take an educated guess and say that your script turns your machine into a swap storm as you get around 30k articles, according to your description. I don't see anything in your code where you could easily free up memory using:

some_large_container = None

Setting something that you know has a large allocation to None tells Python's memory manager that it's available for garbage collection. You also might want to explicitly call gc.collect(), but I'm not sure that would do you much good.

Alternatives you could consider:

  • sqlite3: Instead of a remote SQL database, use sqlite3 as intermediate storage. Exists there does a Python module.
  • Keep appending to the CSV checkpoint file.
  • Compress your strings with zlib.compress().

Any way that you decide to go, you're probably best off doing the collection as phase 1, constructing the Pandas dataframe as phase 2. Never pays off to be clever by a half. The other half tends to hang you.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks @ScottM. re: swaps, the script runs on 16GB AWS EC2, and it's the only thing going--i haven't kept track of memory usage, but is it likely 30k articles would spill into swap? articles being ~100-800 words apiece. (but either way, appending to some kind of persistent on-disk storage seems like it'd be more efficient than what I'm doing currently)
Just an educated guess, given that the system becomes totally unresponsive. Sure, 16GB is a lot of RAM for a virtual machine, but that doesn't actually mean that the underlying AWS hardware node is actually giving you 16GB -- hypervisors do tend to overcommit VM RAM, since in the average case, the VM doesn't actually use all of the underlying RAM ("memory ballooning" if memory serves me correctly.)
Aksi check your process limits, limit at bash or zsh prompt. That could also limit, no pun intended, your total memory usage.
thx Scott-for posterity: i took your advice & stored article results in a list of tuples (url, text, index#). then, every 1000 iterations, i converted the list of tuples to a DataFrame and wrote a new CSV w/ a unique filename to merge later (felt that was more reliable than appending). i tried cutting out the dataframe altogether, but the python csv module wasn't playing nicely w/ my string encodings.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.