0

I have a CSV with a number of filenames and dates:

"doc_1.doc", "date1"
"doc_2.doc", "date2"
"doc_5.doc", "date5"

The issue is that there are many gaps in between file numbers, e.g.: doc_2 and doc_5

I am trying to write a script that parses the CSV and fills in the gaps by comparing each row and filling in the gaps where necessary.

e.g. in this example, it would add

"doc_3.doc", "date copied from date2"
"doc_4.doc", "date copied from date2"

I'm trying to write this script in Ruby since I'm trying to learn the language and clearly I am misunderstanding the way Ruby's looping works because it's not the typical 'for' loops one uses often in PHP etc.

Here is my code so far, any help with the loop itself would be greatly appreciated!

#!/usr/bin/env ruby

require 'csv'

# Load file
csv_fname = './upload-list-docs.csv'

# Parsing function
def parse_csv(csv_fname)
    uploads = []
    last_number = 0

    # Regex to find number in doc_XXX.YYY
    regex_find_number = /(?<=\_)(.*?)(?=\.)/

    csv_content = CSV.read(csv_fname)

    # Skip header row
    csv_content.shift

    csv_content.each do |row|
        current_number = row[0].match regex_find_number
        current_date = row[1]
        last_date = current_date

        until last_number == current_number do
            uploads << [last_number, last_date]
            last_number += 1
        end
    end

    return uploads
end

puts parse_csv(csv_fname)

And some sample CSV

"file_name","date"
"doc_1.jpg","2011-05-11 09:16:05.000000000"
"doc_3.doc","2011-05-11 10:10:36.000000000"
"doc_4.doc","2011-05-11 10:17:19.000000000"
"doc_6.doc","2011-05-11 10:58:35.000000000"
"doc_7.pdf","2011-05-11 11:16:22.000000000"
"doc_8.pdf","2011-05-11 11:19:29.000000000"
"doc_9.docx","2011-05-11 11:40:03.000000000"
"doc_13.pdf","2011-05-11 12:26:32.000000000"
"doc_14.docx","2011-05-11 12:34:50.000000000"
"doc_15.doc","2011-05-11 12:40:12.000000000"
"doc_16.doc","2011-05-11 13:03:11.000000000"
"doc_17.doc","2011-05-11 13:03:58.000000000"
"doc_19.pdf","2011-05-11 13:25:07.000000000"
"doc_20.rtf","2011-05-11 13:34:26.000000000"
"doc_21.rtf","2011-05-11 13:35:25.000000000"
"doc_24.doc","2011-05-11 13:49:02.000000000"
"doc_25.doc","2011-05-11 14:05:04.000000000"
"doc_26.pdf","2011-05-11 14:18:26.000000000"
"doc_27.rtf","2011-05-11 14:30:19.000000000"
"doc_28.doc","2011-05-11 14:33:13.000000000"
"doc_29.jpg","2011-05-11 15:07:27.000000000"
"doc_30.doc","2011-05-11 15:22:30.000000000"
"doc_31.doc","2011-05-11 15:31:07.000000000"
"doc_34.doc","2011-05-11 15:51:56.000000000"
"doc_35.doc","2011-05-11 15:55:15.000000000"
"doc_36.doc","2011-05-11 16:06:46.000000000"
"doc_38.wps","2011-05-11 16:21:08.000000000"
"doc_39.doc","2011-05-11 16:30:57.000000000"
"doc_40.doc","2011-05-11 16:41:55.000000000"
"doc_43.JPG","2011-05-11 17:03:40.000000000"
"doc_46.doc","2011-05-11 17:28:13.000000000"
"doc_51.doc","2011-05-11 17:50:34.000000000"
"doc_52.doc","2011-05-11 18:03:13.000000000"
"doc_53.doc","2011-05-11 18:43:48.000000000"
"doc_54.doc","2011-05-11 18:54:45.000000000"
"doc_55.doc","2011-05-11 19:31:03.000000000"
"doc_56.doc","2011-05-11 19:31:23.000000000"
"doc_57.doc","2011-05-11 20:17:38.000000000"
"doc_59.jpg","2011-05-11 20:22:55.000000000"
"doc_61.pdf","2011-05-11 21:14:52.000000000"
5
  • What happens when you run this code? Commented Nov 20, 2013 at 13:08
  • You get an endless loop, right? Commented Nov 20, 2013 at 13:11
  • Yes, endless loop because 'current_number' never changes. Commented Nov 20, 2013 at 13:11
  • The first thing to do: disable the endless inner loop by changing into until last_number >= current_number do ... that will give you a clue. Commented Nov 20, 2013 at 13:13
  • @waffl current_number doesn't have to change since last_number is changing. As long as one of them is changing (and getting closer to the termination condition) Commented Nov 20, 2013 at 16:25

3 Answers 3

1

An OO approach. Note that I did this when I thought you wanted blanks filled with [doc_X.doc, date] as opposed to [X, date] - for which this approach is more appropriate as it required more regexes on @file_name. This may be a bit verbose now but nevertheless it works and is quite readable.

require 'csv'

class Upload

  attr_reader :file_number, :date

  def initialize(file_name_or_number, date)
    @date = date
    @file_number = if file_name_or_number.is_a?(String)
                     file_name_or_number[/_(\d+)\./, 1].to_i
                   else
                     file_name_or_number
                   end
  end

  def to_a
    [@file_number, @date]
  end
end

class UploadCollection

  attr_reader :uploads

  def initialize(input_file)
    # Slice off all but the first element
    input_data = CSV.read(input_file)[1..-1] 
    # Create an array of Upload objects and sort by file number
    @uploads = input_data
                  .map { |row| Upload.new(row[0], row[1]) }
                  .sort_by(&:file_number)
  end

  def fill_blanks!
    # Get the smallest and largest file number
    # (they're sorted this way, remember)
    min, max = @uploads.first.file_number, @uploads.last.file_number
    # Create an array of all numbers between min and max, and
    # remove those elements already representing a file number
    missing = (min..max).to_a - @uploads.map(&:file_number)
    missing.each do |num|
      # Explaining how this works makes my head ache.  Check out the
      # docs for Array#insert.
      @uploads.insert(num - 1, Upload.new(num, @uploads[num-2].date))
    end

    # Non-ambiguous return value
    true
  end

  def to_a
    @uploads.map(&:to_a)
  end

  def write_csv(file_path)
    CSV.open(file_path, 'wb') do |csv|
      csv << ['file_number', 'date'] # Headers
      to_a.each { |u| csv << u }
    end
  end
end

file = 'fnames.csv'
collection = UploadCollection.new(file)
collection.fill_blanks!
puts collection.to_a
collection.write_csv('out.csv')
Sign up to request clarification or add additional context in comments.

4 Comments

This worked perfectly, I just altered def row [@file_name, @date] end to def row [@file_number, @date] end to get a list of only numbers. My only final question now, if there's any chance you can help, is how to output the array to a CSV file?
Actually, I've noticed that this only seems to insert one record in the gaps: For example, 20, 21, 25, 26 becomes 20, 21, 22, 25, 26
You are of course correct. I'm not sure how trivial it will be to fix this.
I've combined it with the other answer and am very close to the solution, the only issue is that I am now getting, for example from: 7, 8, 12 to: 7, 8, 9, 9, 9, 12 which I imagine is due to file_number not updating: pastie.org/8495897
1

Here is how I would write the code:

require 'csv'
csv_fname = './upload-list-docs.csv'

# Create a structure to get some easy methods:
Myfile = Struct.new(:name,:date){
  def number
    name[/(?<=\_)(.*?)(?=\.)/].to_i
  end
  def next_file
    Myfile.new(name.gsub(/(?<=\_)(.*?)(?=\.)/){|num|num.next}, date)
  end
}

# Read the content and add it to and array:
content = CSV.read(csv_fname)[1..-1].map{|data| Myfile.new(*data)}

# Add first entry to an result array:
result = [content.shift]

until content.empty?

 # Get new file:
 new_file = content.shift

 # Fill up with new files until we hit next file:
 files_between = new_file.number - result.last.number
 unless files_between == 1
   (files_between - 1).times do
     result << result.last.next_file
   end
 end

 # Add next file:
 result << new_file

end

# Map result back to array:
result.map!(&:to_a)

4 Comments

This works perfectly, can you explain me how I could get the output to just be the output? (the result of the regex, instead of the prefix and suffix?) I'm trying to modify the line for the next_file construct Myfile.new(name.gsub(/(?<=\_)(.*?)(?=\.)/){|num|num.next}, date) but not having any luck.
Not sure what you mean exactly. Maybe instead of string.match(regexp) you should just use string[regexp]. It will return a string instead of a match-object. If that is not what you are looking to, you need to rephrase your question.
Hm, instead of doc_1.jpg, date, I'd like it to just be 1, date
Ok, so in my version it would be: Myfile.new(name[/(?<=\_)(.*?)(?=\.)/].next, date). But the problem is that that breaks the regexp for the next iteration. You could change that to only capture digits, but then it might break if the file names contains other digits.
0

The problem is not with the loops (aside the dangerous == which should be changed into >= as already said above), but with extracting an integer number from a regex match.

current_number = row[0].match( regex_find_number )[0].to_i

1 Comment

Unfortunately this does not seem to work, as I believe the issue is indeed in the loop. The each function doesn't iterate the CSV sequentially, I suppose, and is why the loop is infinite.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.