6

I'm trying to read a large amount of cells from database (over 100.000) and write them to a csv file on VPS Ubuntu server. It happens that server doesn't have enough memory.

I was thinking about reading 5000 rows at once and writing them to file, then reading another 5000, etc..

How should I restructure my current code so that memory won't be consumed fully?

Here's my code:

def write_rows(emails)

  File.open(file_path, "w+") do |f|
    f << "email,name,ip,created\n"
    emails.each do |l|
      f << [l.email, l.name, l.ip, l.created_at].join(",") + "\n"
    end
  end
end

The function is called from sidekiq worker by:

write_rows(user.emails)

Thanks for help!

1 Answer 1

4

The problem here is that when you call emails.each ActiveRecord loads all the records from the database and keeps them in memory, to avoid this you can use the method find_each:

require 'csv'

BATCH_SIZE = 5000

def write_rows(emails)
  CSV.open(file_path, 'w') do |csv|

    csv << %w{email name ip created}

    emails.find_each do |email|
      csv << [email.email, email.name, email.ip, email.created_at]
    end
  end
end

By default find_each loads records in batches of 1000 at a time, if you want to load batches of 5000 record you have to pass the option :batch_size to find_each:

emails.find_each(:batch_size => 5000) do |email|
  ...

More information about the find_each method (and the related find_in_batches) can be found on the Ruby on Rails Guides.

I've used the CSV class to write the file instead of joining fields and lines by hand. This is not inteded to be a performance optimization since writing on the file shouldn't be the bottleneck here.

Sign up to request clarification or add additional context in comments.

4 Comments

thanks.. what about writing to acsv file? will csv gem optimize writing to the file?
@Aljaz Not really, I've used CSV just to avoid joining fields/lines. csv is not a gem, it comes from the Ruby stdlib.
@Aljaz The CSV module will, however, ensure that all of your values are escaped correctly. If there's any chance any of the values from your database will have a comma or newline in them (i.e. if you accept user input and don't have strict validations to reject those characters), you should use the CSV module instead of doing this "manually." Honestly 100,000 rows is not very many and the CSV module (which, since 1.9.3. is based on FasterCSV) will do this very quickly.
find_each didn't help in my case. with 700k records it used 1GB of memory for some reason.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.