Ruby: Issues parsing a CSV and looping through rows

Question

I have a CSV with a number of filenames and dates:

"doc_1.doc", "date1"
"doc_2.doc", "date2"
"doc_5.doc", "date5"

The issue is that there are many gaps in between file numbers, e.g.: doc_2 and doc_5

I am trying to write a script that parses the CSV and fills in the gaps by comparing each row and filling in the gaps where necessary.

e.g. in this example, it would add

"doc_3.doc", "date copied from date2"
"doc_4.doc", "date copied from date2"

I'm trying to write this script in Ruby since I'm trying to learn the language and clearly I am misunderstanding the way Ruby's looping works because it's not the typical 'for' loops one uses often in PHP etc.

Here is my code so far, any help with the loop itself would be greatly appreciated!

#!/usr/bin/env ruby

require 'csv'

# Load file
csv_fname = './upload-list-docs.csv'

# Parsing function
def parse_csv(csv_fname)
    uploads = []
    last_number = 0

    # Regex to find number in doc_XXX.YYY
    regex_find_number = /(?<=\_)(.*?)(?=\.)/

    csv_content = CSV.read(csv_fname)

    # Skip header row
    csv_content.shift

    csv_content.each do |row|
        current_number = row[0].match regex_find_number
        current_date = row[1]
        last_date = current_date

        until last_number == current_number do
            uploads << [last_number, last_date]
            last_number += 1
        end
    end

    return uploads
end

puts parse_csv(csv_fname)

And some sample CSV

"file_name","date"
"doc_1.jpg","2011-05-11 09:16:05.000000000"
"doc_3.doc","2011-05-11 10:10:36.000000000"
"doc_4.doc","2011-05-11 10:17:19.000000000"
"doc_6.doc","2011-05-11 10:58:35.000000000"
"doc_7.pdf","2011-05-11 11:16:22.000000000"
"doc_8.pdf","2011-05-11 11:19:29.000000000"
"doc_9.docx","2011-05-11 11:40:03.000000000"
"doc_13.pdf","2011-05-11 12:26:32.000000000"
"doc_14.docx","2011-05-11 12:34:50.000000000"
"doc_15.doc","2011-05-11 12:40:12.000000000"
"doc_16.doc","2011-05-11 13:03:11.000000000"
"doc_17.doc","2011-05-11 13:03:58.000000000"
"doc_19.pdf","2011-05-11 13:25:07.000000000"
"doc_20.rtf","2011-05-11 13:34:26.000000000"
"doc_21.rtf","2011-05-11 13:35:25.000000000"
"doc_24.doc","2011-05-11 13:49:02.000000000"
"doc_25.doc","2011-05-11 14:05:04.000000000"
"doc_26.pdf","2011-05-11 14:18:26.000000000"
"doc_27.rtf","2011-05-11 14:30:19.000000000"
"doc_28.doc","2011-05-11 14:33:13.000000000"
"doc_29.jpg","2011-05-11 15:07:27.000000000"
"doc_30.doc","2011-05-11 15:22:30.000000000"
"doc_31.doc","2011-05-11 15:31:07.000000000"
"doc_34.doc","2011-05-11 15:51:56.000000000"
"doc_35.doc","2011-05-11 15:55:15.000000000"
"doc_36.doc","2011-05-11 16:06:46.000000000"
"doc_38.wps","2011-05-11 16:21:08.000000000"
"doc_39.doc","2011-05-11 16:30:57.000000000"
"doc_40.doc","2011-05-11 16:41:55.000000000"
"doc_43.JPG","2011-05-11 17:03:40.000000000"
"doc_46.doc","2011-05-11 17:28:13.000000000"
"doc_51.doc","2011-05-11 17:50:34.000000000"
"doc_52.doc","2011-05-11 18:03:13.000000000"
"doc_53.doc","2011-05-11 18:43:48.000000000"
"doc_54.doc","2011-05-11 18:54:45.000000000"
"doc_55.doc","2011-05-11 19:31:03.000000000"
"doc_56.doc","2011-05-11 19:31:23.000000000"
"doc_57.doc","2011-05-11 20:17:38.000000000"
"doc_59.jpg","2011-05-11 20:22:55.000000000"
"doc_61.pdf","2011-05-11 21:14:52.000000000"

The first thing to do: disable the endless inner loop by changing into until last_number >= current_number do ... that will give you a clue. — Joachim W
– Joachim W, Commented Nov 20, 2013 at 13:13
@waffl current_number doesn't have to change since last_number is changing. As long as one of them is changing (and getting closer to the termination condition) — MxLDevs
– MxLDevs, Commented Nov 20, 2013 at 16:25

SLD · Accepted Answer · 2013-11-20 19:31:49Z

1

An OO approach. Note that I did this when I thought you wanted blanks filled with [doc_X.doc, date] as opposed to [X, date] - for which this approach is more appropriate as it required more regexes on @file_name. This may be a bit verbose now but nevertheless it works and is quite readable.

require 'csv'

class Upload

  attr_reader :file_number, :date

  def initialize(file_name_or_number, date)
    @date = date
    @file_number = if file_name_or_number.is_a?(String)
                     file_name_or_number[/_(\d+)\./, 1].to_i
                   else
                     file_name_or_number
                   end
  end

  def to_a
    [@file_number, @date]
  end
end

class UploadCollection

  attr_reader :uploads

  def initialize(input_file)
    # Slice off all but the first element
    input_data = CSV.read(input_file)[1..-1] 
    # Create an array of Upload objects and sort by file number
    @uploads = input_data
                  .map { |row| Upload.new(row[0], row[1]) }
                  .sort_by(&:file_number)
  end

  def fill_blanks!
    # Get the smallest and largest file number
    # (they're sorted this way, remember)
    min, max = @uploads.first.file_number, @uploads.last.file_number
    # Create an array of all numbers between min and max, and
    # remove those elements already representing a file number
    missing = (min..max).to_a - @uploads.map(&:file_number)
    missing.each do |num|
      # Explaining how this works makes my head ache.  Check out the
      # docs for Array#insert.
      @uploads.insert(num - 1, Upload.new(num, @uploads[num-2].date))
    end

    # Non-ambiguous return value
    true
  end

  def to_a
    @uploads.map(&:to_a)
  end

  def write_csv(file_path)
    CSV.open(file_path, 'wb') do |csv|
      csv << ['file_number', 'date'] # Headers
      to_a.each { |u| csv << u }
    end
  end
end

file = 'fnames.csv'
collection = UploadCollection.new(file)
collection.fill_blanks!
puts collection.to_a
collection.write_csv('out.csv')

edited Nov 20, 2013 at 19:31

answered Nov 20, 2013 at 16:16

SLD

6594 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

waffl Over a year ago

This worked perfectly, I just altered def row [@file_name, @date] end to def row [@file_number, @date] end to get a list of only numbers. My only final question now, if there's any chance you can help, is how to output the array to a CSV file?

waffl Over a year ago

Actually, I've noticed that this only seems to insert one record in the gaps: For example, 20, 21, 25, 26 becomes 20, 21, 22, 25, 26

SLD Over a year ago

You are of course correct. I'm not sure how trivial it will be to fix this.

waffl Over a year ago

I've combined it with the other answer and am very close to the solution, the only issue is that I am now getting, for example from: 7, 8, 12 to: 7, 8, 9, 9, 9, 12 which I imagine is due to file_number not updating: pastie.org/8495897

hirolau · Accepted Answer · 2013-11-20 13:35:59Z

1

Here is how I would write the code:

require 'csv'
csv_fname = './upload-list-docs.csv'

# Create a structure to get some easy methods:
Myfile = Struct.new(:name,:date){
  def number
    name[/(?<=\_)(.*?)(?=\.)/].to_i
  end
  def next_file
    Myfile.new(name.gsub(/(?<=\_)(.*?)(?=\.)/){|num|num.next}, date)
  end
}

# Read the content and add it to and array:
content = CSV.read(csv_fname)[1..-1].map{|data| Myfile.new(*data)}

# Add first entry to an result array:
result = [content.shift]

until content.empty?

 # Get new file:
 new_file = content.shift

 # Fill up with new files until we hit next file:
 files_between = new_file.number - result.last.number
 unless files_between == 1
   (files_between - 1).times do
     result << result.last.next_file
   end
 end

 # Add next file:
 result << new_file

end

# Map result back to array:
result.map!(&:to_a)

answered Nov 20, 2013 at 13:35

hirolau

14k9 gold badges39 silver badges51 bronze badges

4 Comments

waffl Over a year ago

This works perfectly, can you explain me how I could get the output to just be the output? (the result of the regex, instead of the prefix and suffix?) I'm trying to modify the line for the next_file construct Myfile.new(name.gsub(/(?<=\_)(.*?)(?=\.)/){|num|num.next}, date) but not having any luck.

hirolau Over a year ago

Not sure what you mean exactly. Maybe instead of string.match(regexp) you should just use string[regexp]. It will return a string instead of a match-object. If that is not what you are looking to, you need to rephrase your question.

waffl Over a year ago

Hm, instead of doc_1.jpg, date, I'd like it to just be 1, date

hirolau Over a year ago

Ok, so in my version it would be: Myfile.new(name[/(?<=\_)(.*?)(?=\.)/].next, date). But the problem is that that breaks the regexp for the next iteration. You could change that to only capture digits, but then it might break if the file names contains other digits.

Joachim W · Accepted Answer · 2013-11-20 13:22:46Z

0

The problem is not with the loops (aside the dangerous == which should be changed into >= as already said above), but with extracting an integer number from a regex match.

current_number = row[0].match( regex_find_number )[0].to_i

edited Nov 20, 2013 at 13:22

answered Nov 20, 2013 at 13:17

Joachim W

8,4477 gold badges45 silver badges69 bronze badges

1 Comment

waffl Over a year ago

Unfortunately this does not seem to work, as I believe the issue is indeed in the loop. The each function doesn't iterate the CSV sequentially, I suppose, and is why the loop is infinite.

Collectives™ on Stack Overflow

Ruby: Issues parsing a CSV and looping through rows

3 Answers 3

4 Comments

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related