2

Using Ruby 1.9.3, I want to read in a CSV file with headers and scan each single field to see if it is left empty and does not contain a value, like foo,,bar,foofoo,barbar(the second one).

My approach is as follows:

require 'CSV'

#read csv file line by line
CSV.foreach(filename,headers:true) do |row|

    #loop through each element within the current row
    for i in (0..row.length-1)

        #check for empty fields
        if !row[i]
            puts "empty field"
        end
     end
end

Well, this works, but when processing a file with ~18 million fields, this is quite slow, and I have many of them. Is there any faster and more elegant ways to do this?

6
  • 2
    It's odd to see for being used in Ruby code. Most of the time people do row.each do |column| instead. Can you define "quite slow"? This could be a 90GB file you're processing here. You haven't given any context. Commented Mar 31, 2014 at 16:07
  • Uhm that's true, looks strange with for... does it make a big difference though? I changed it to row.each do |column| and if column[1] and now it takes about 10 secs per CSV file (30-40 MB, ~32000 rows, 577 columns). Maybe there was something else messed up. I can live with that for my purposes, however, if somebody knows something faster than this, I still appreciate. Commented Mar 31, 2014 at 16:42
  • 1
    It should be if column at that point since it's expanding row into a series of independent column entries. column[1] refers to the 2nd character of the column string. As for speed, CSV decoding isn't always blazingly fast, especially on larger files. The CSV module in 1.9.3 is better, but you might want to try Ruby 2.1 and see if that's even faster, which it should be. Commented Mar 31, 2014 at 17:21
  • 1
    A file with ~18 million fields? Fields? Do you mean "rows" or "lines" instead? Good luck trying to read a single row with ~18 million fields, even if they're empty. Commented Mar 31, 2014 at 18:22
  • 1
    In headers: true mode it might need to be declared as row.each do |header, column| to extract those values. Commented Apr 1, 2014 at 14:31

3 Answers 3

4

Using grep

Edit: Having my big file around I also tested Uri Agassi's aproach using grep to get the lines of the file with empty fields:

File.new(filename).grep(/(^,|,(,|$))/)

It's about 10 times faster. If you need access to the fields you can use CSV.parse:

require 'csv'

File.new("/tmp/big.csv").grep(/(^,|,(,|%))/).each do |row_string|
  CSV.parse(row_string) do |row|
    puts row[1]
  end
end

Using a native CSV parser

Otherwise, if you have to parse the whole CSV file anyway, the answer is most likely no. Try running your script without the checking part - just reading the CSV rows. You will see no change in running time. This is because most of the time is spent reading and parsing the CSV file.

You might wonder if there is a faster CSV library for ruby. There is indeed a gem called FasterCSV but Ruby 1.9 has adopted it as its built-in CSV library, so it probably won't get much faster using Ruby only.

There is a ruby gem named excelsior which uses a native CSV parser. You can install it via gem install excelsior and use it like this:

require 'excelsior'

Excelsior::Reader.rows(File.open('/tmp/big.csv')) do |row|

  row.each do |column|

    unless column
      puts "empty field"
    end
  end
end

I tested this code with a file like yours (72M, ~30k entries à 2.5k fields) and it is about twice as fast, however it segfaults after a few lines, so the gem might not be stable.

Using CSV

As you mentioned in your comment, there are a few more idiomatic ways to write this, such as using each instead of the for loop or using unless instead of if !, and using two spaces for indentation, which will turn it into:

require 'csv'

CSV.foreach('/tmp/big.csv') do |row|

  row.each do |column|
    unless column
      puts "empty field"
    end
  end

end

This won't improve the speed though.

Sign up to request clarification or add additional context in comments.

1 Comment

Hell yeah, the grep approach is really fast. That's enough for my purpose, thanks for the informative answer!
1

Parsing the CSVs could take a lot of your CPU. If all you want is to get the lines which contain an empty field (i.e. contain ,, start with a , or end with a ,), you can use grep on the raw lines of the files, without actually parsing them:

File.new(filename).grep(/(^,|,(,|$))/)
# => all the lines which have an empty field

I'm afraid that you still would go over all the files and read them, so it might not be as fast as you would hope, but unless there is some index on the files, I can't see a way around it.

Comments

1

You can check all columns at once using Enumerable#any?

CSV.foreach(filename,headers:true) do |row|
  puts "empty field" if row.any?(&:nil?)
end

I think the grep solution will still be faster. Shelling out to the linux grep command would be the fastest.

2 Comments

Do you mean Enumerable#any?? all? would take longer than any?.
@theTinMan Yes, it was a typo; I had any? in the code. Thanks for pointing it out.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.