0

I am working on a project that has 2 separate input files, each with some information that relates to the other file.

I have loaded them each into their own arrays after parsing them like so

file_1 << "#{contract_id}|#{region}|#{category}|#{supplier_ID}"
file_2 << "#{contract_id}|#{region}|#{category}|#{manufacturer}|#{model}"

File 1 has 30,000 lines and File 2 has 400,000 lines. My desired output will have somewhere in the neighborhood of 600,000 lines from my estimations.

Now my problem is figuring out a way to combine them, as they have a many-to-many relationship. For every time the contract_id, region AND category match, i need to have a record that looks like the following:

supplier_ID region category manufacturer model.

my initial thought was to iterate over one of the arrays and put everything into a hash using the #{contract_id}|#{region}|#{category}|#{manufacturer} as the KEY and the #{model} as the value. But the limitation there is that it only iterates over the array once and thus the output is limited to the number of elements in the respective array.

2
  • I do not understand the question. Commented Sep 24, 2014 at 17:03
  • 2
    Why are you composing this horribly ugly pipe-delimited string when you could just add values as an array? Commented Sep 24, 2014 at 17:04

1 Answer 1

1

My understanding of your question:

File 1 has the columns contract_id, region, category, supplier_id.

File 2 has the columns contract_id, region, category, manufacturer, model

You want to a program that will take file 1 and file 2 as inputs do the equivalent of an SQL join to produce a new file with the following columns: supplier_id, region, category, manufacturer, model. Your join condition is that the contract_id, region, and category need to match.

Here is how I would tackle this:

Step 1: Read both files into arrays that have the data from each. Don't store the data entries as an ugly pipe-delimited string; store them as an array or a hash.

file_1_entries << [contract_id, region, category, supplier_ID]

Step 2: Iterate over the data from both files and make hashes to index them by the columns you care about (contract_id, region, and category). For example, to index file 1, you would make a hash whose key is some combination of those three columns (either an array or a string) and the value is an array of entries from file 1 that match.

file_1_index = {}
file_1_entries.each do |x|
  key = some_function_of(x)
  file_1_index[key] ||= []
  file_1_index[key] << x
end

Step 3: Iterate over one of your index hashes, and use the index hashes to do the join you want to do.

file_1_index.keys.each do |key|
  file_1_matching_entries = file_1_index.fetch(key, [])
  file_2_matching_entries = file_2_index.fetch(key, [])
  # nested loop to do the join
end

I can't go into very much detail on each of these steps because you asked a pretty broad question and it would take a long time to add all the details. But you should try to do these steps and ask more specific questions if you get stuck.

It's possible your machine might run out of memory while you are doing this, depending on your computer. In that case, you might need to build a temporary database (e.g. with sqlite) and then perform the join using an actual SQL query instead of trying to do it yourself in Ruby.

Sign up to request clarification or add additional context in comments.

2 Comments

I think you have understood my question perfectly. I will start working on it and update my question if i get stuck. Thanks!
Your answer was just the help I needed. I was able to get everything working yesterday and wanted to say thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.