1

i have to compare two Csv files which are populated by an ecommerce. The files are always similar, except that the newer ones have a different number of items, because the catalogue changes every week.

Example of the CSV file:

sku_code, description, price, url    
001, product one, 100, www.something.com/1 
002, prouct two, 150, www.something.com/2

By comparing two files extracted on different days, i would like to produce a list of products which have been discontinued and another list of products which have been added.

My index should be the Sku_code, which is univocal inside the catalogue.

I've been using this code from stackoverflow:

#old file
f1 = IO.readlines("oldfeed.csv").map(&:chomp)
#new file
f2 = IO.readlines("newfeed.csv").map(&:chomp)

#find new products
File.open("new_products.txt","w"){ |f| f.write((f2-f1).join("\n")) }

#find old products
File.open("deleted_products.txt","w"){ |f| f.write((f1-f2).join("\n")) }

My issue

It works well, except in one case: when one of the fields after the sku_code is changed, the products is considered "new" (eg: a change of price ) even though for my needs, it's the same product.

What it the smartest way to compare only the sku_code instead of the whole row?

2
  • 1
    Is there a reason you're not using the ''csv" gem? Commented Jul 25, 2013 at 14:12
  • @Duck1337 no reason in particular! i've just begun learning ruby and i'm not familiar with the many Gems in existance. Commented Jul 25, 2013 at 23:36

3 Answers 3

2

No need to use a CSV library, because you are not interested in the actual values (except the sku_code). I'd put each line into a hash with sku_code as a key, compare the sku_codes, and them retrieve the values from those hashes.

#old file
f1 = IO.readlines("oldfeed.csv").map(&:chomp)
f1_hash = f1[1..-1].inject(Hash.new) {|hash,line| hash[line[/^\d+/]] = line; hash}
#new file
f2 = IO.readlines("newfeed.csv").map(&:chomp)
f2_hash = f2[1..-1].inject(Hash.new) {|hash,line| hash[line[/^\d+/]] = line; hash}

#find new products
new_product_keys = f2_hash.keys - f1_hash.keys
new_products = new_product_keys.map {|sku_code| f2_hash[sku_code] }

#find old products
old_product_keys = f1_hash.keys - f2_hash.keys
old_products = old_product_keys.map {|sku_code| f1_hash[sku_code] }

# write new products to file
File.open("new_products.txt","w") do |f|
  f.write "#{f2.first}\n"
  f.write new_products.join("\n")
end

#write old products to file
File.open("deleted_products.txt","w") do |f|
  f.write "#{f1.first}\n"
  f.write old_products.join("\n")
end

The first line of each csv file contains only column names. So I skipped the first line of each csv file (f1[1..-1]) and added it later when writing the new file (f.write "#{f1.first}\n").

Tested it for two imaginary csv files.


EDIT: Accidentally computed old_products using the new_product_keys, which was a typo. Thanks to those, who tried to edit my answer (but were unfortunately rejected).

Sign up to request clarification or add additional context in comments.

3 Comments

i've tried the code but the products files remain empty, except for the first line with the column names.
Possible issues: the delimiter used in the file is the pipe (|) instead of the comma. Another mistake on my part is that the SKU_CODE is alfanumeric, instead of being simply numeric (eg: UILED19X11) so probably the regex \d is not working as it should.
Confirmed! i've changed the regex and it works beautifully: f1_hash = f1[1..-1].inject(Hash.new) {|hash,line| hash[line[/^\w+/]] = line; hash}
0
 require 'csv'
 #I'm really hungover
 DOA = 'oldfeed.csv'
 DOB = 'newfeed.csv'
 #^this is where your files are located

DOC = 'finished_product.csv'
#this little guy here is a csv file that has the unique values
#you dont need to create this file, ruby will make it for you


holder_1 = CSV.read(DOA)
holder_2 = CSV.read(DOB)
#we just put both csv files into an array
#way too early to be up
#assuming the Sku_code is the first number '001'
#holder_1[0][0] = 001
#holder_1[1][0] = 002    

this should get you moving, you need two while loops and an if statement, do you need more info? Or are you okay with this?

If you want a csv file to show you your results, it would be easier to use the csv gem.

Comments

0

Assuming that you don't have a big performance concern, I think you want to strive for the least amount of code. Even if performance is an issue, I'd start with the simplest approach and refine from there based on your needs.

I think using the CSV gem is a fine idea, because it's one less thing you have to write code for. That said, here is another way to approach this problem. Note that the diff function below works on either an array or a hash and is independent of how the key is defined. It uses an array internally for the key lookup, but changing that to use a hash is straightforward.

l1a = "001, product one, 100, www.something.com/1"
l2 = "002, prouct two, 150, www.something.com/2"
l1b = "001, product one, 120, www.something.com/1"
l3 = "003, product three, 100, www.something.com/1"
l4 = "004, product four, 100, www.something.com/1"

file_old = [l1a, l2, l3]
file_new = [l1b, l2, l4]

sku = -> (record) do
  record.split(',')[0]
end

def diff(set1, set2, keyproc)
  set2_keys = set2.collect {|e| keyproc.call(e)}
  set1.reject {|e| set2_keys.include?(keyproc.call(e))}
end

puts diff(file_old, file_new, sku)
# => "003, product three, 100, www.something.com/1"
puts diff(file_new, file_old, sku)
# => "004, product four, 100, www.something.com/1"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.