Conditional array or hash combining

Question

I am working on a project that has 2 separate input files, each with some information that relates to the other file.

I have loaded them each into their own arrays after parsing them like so

file_1 << "#{contract_id}|#{region}|#{category}|#{supplier_ID}"
file_2 << "#{contract_id}|#{region}|#{category}|#{manufacturer}|#{model}"

File 1 has 30,000 lines and File 2 has 400,000 lines. My desired output will have somewhere in the neighborhood of 600,000 lines from my estimations.

Now my problem is figuring out a way to combine them, as they have a many-to-many relationship. For every time the contract_id, region AND category match, i need to have a record that looks like the following:

supplier_ID region category manufacturer model.

my initial thought was to iterate over one of the arrays and put everything into a hash using the #{contract_id}|#{region}|#{category}|#{manufacturer} as the KEY and the #{model} as the value. But the limitation there is that it only iterates over the array once and thus the output is limited to the number of elements in the respective array.

Why are you composing this horribly ugly pipe-delimited string when you could just add values as an array? — tadman
– tadman, Commented Sep 24, 2014 at 17:04

David Grayson · Accepted Answer · 2014-09-25 05:39:59Z

1

My understanding of your question:

File 1 has the columns contract_id, region, category, supplier_id.

File 2 has the columns contract_id, region, category, manufacturer, model

You want to a program that will take file 1 and file 2 as inputs do the equivalent of an SQL join to produce a new file with the following columns: supplier_id, region, category, manufacturer, model. Your join condition is that the contract_id, region, and category need to match.

Here is how I would tackle this:

Step 1: Read both files into arrays that have the data from each. Don't store the data entries as an ugly pipe-delimited string; store them as an array or a hash.

file_1_entries << [contract_id, region, category, supplier_ID]

Step 2: Iterate over the data from both files and make hashes to index them by the columns you care about (contract_id, region, and category). For example, to index file 1, you would make a hash whose key is some combination of those three columns (either an array or a string) and the value is an array of entries from file 1 that match.

file_1_index = {}
file_1_entries.each do |x|
  key = some_function_of(x)
  file_1_index[key] ||= []
  file_1_index[key] << x
end

Step 3: Iterate over one of your index hashes, and use the index hashes to do the join you want to do.

file_1_index.keys.each do |key|
  file_1_matching_entries = file_1_index.fetch(key, [])
  file_2_matching_entries = file_2_index.fetch(key, [])
  # nested loop to do the join
end

I can't go into very much detail on each of these steps because you asked a pretty broad question and it would take a long time to add all the details. But you should try to do these steps and ask more specific questions if you get stuck.

It's possible your machine might run out of memory while you are doing this, depending on your computer. In that case, you might need to build a temporary database (e.g. with sqlite) and then perform the join using an actual SQL query instead of trying to do it yourself in Ruby.

edited Sep 25, 2014 at 5:39

answered Sep 24, 2014 at 17:12

David Grayson

88.1k24 gold badges159 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Todd J. Over a year ago

I think you have understood my question perfectly. I will start working on it and update my question if i get stuck. Thanks!

Todd J. Over a year ago

Your answer was just the help I needed. I was able to get everything working yesterday and wanted to say thanks!

Collectives™ on Stack Overflow

Conditional array or hash combining

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related