ruby remove duplicates and modify existing element in nested array by iterating over another array

Question

I have an array doc_count that contains string with Date (year-month-day) among other things.

I'd like to transform doc_count into goal by removing "duplicates" that means, I'd like to keep the longer date-string and remove the short date-string eg.

"2019-02-01: 186904" instead of "2019-02-01"

  doc_count = [
    ["foo", "2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01"],
    ["bar", "2019-01-01: 8876", "2019-04-01: 8694", "2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01"]
  ]

  goal = [
    ["foo", "2019-01-01", "2019-02-01: 186904", "2019-03-01: 196961", "2019-04-01"],
    ["bar", "2019-01-01: 8876", "2019-02-01", "2019-03-01", "2019-04-01: 8694"]
  ]

   month.match(/^\d{4}-\d{2}-\d{2}/) && month.include?(': ') ? 
     month.match(/^\d{4}-\d{2}-\d{2}/)[0] : month

  my_attempt = doc_count.each do |topic|
    topic.each do |el|
      topic.delete(el) if el == string_to_month(el)
    end
  end

For some reason my attempt fails to generate an array identical to goals.

  2.6.3 (main):0 > my_attempt       
  => [
    [0] [
      [0] "foo",
      [1] "2019-02-01: 186904",
      [2] "2019-03-01: 196961",
      [3] "2019-02-01",
      [4] "2019-04-01"
    ],
    [1] [
      [0] "bar",
      [1] "2019-01-01: 8876",
      [2] "2019-04-01: 8694",
      [3] "2019-02-01",
      [4] "2019-04-01"
    ]
  ]

How can I fix this? Thank you very much!

What if "2019-04-01" occurs before "2019-04-01: 8694"? Which one should be removed? — Pavel Mikhailyuk
– Pavel Mikhailyuk, Commented Jan 9, 2020 at 14:27
@PavelMikhailyuk string with a colon should always stay in array and the string only with date should be removed. the order in array isn't relevant. — Sandra Cieseck
– Sandra Cieseck, Commented Jan 9, 2020 at 14:37
"the order in array isn't relevant" but example shows, that goal "lines" are ordered by complex rule: "foo" at the begin and then by date. — Pavel Mikhailyuk
– Pavel Mikhailyuk, Commented Jan 9, 2020 at 14:57

StandardNerd · Accepted Answer · 2020-01-14 08:25:06Z

2

One solution could be a combination of Array#flat_map und #max_by

method #flat_map returns a new array with the concatenated results of running block once for every element and #max_by an array of maximum elements. You already used #match to check for date-format but in this example there's no need to move it in a separate method.

solution = doc_count.map do |topic|
  topic.group_by { |s| s[0..9] }.flat_map do 
    |key, values| key.match?(/^\d{4}-\d{2}-\d{2}/) ? [values.max_by(&:size)] : values 
  end.sort.rotate!(-1)
end

last but not least #sort and #rotate(-1) to get the desired sort order of the array.

UPDATE: please use Cary Swoveland's solution, it's better and he did an extraordinary job to explain the steps in detail.

edited Jan 14, 2020 at 8:25

answered Jan 9, 2020 at 17:05

StandardNerd

4,18310 gold badges49 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Cary Swoveland Over a year ago

Perhaps neither better nor worse, but consider first separating the dates from the non-dates, then non_dates + dates.group_by(&:size).values.map { |a| a.max_by(&:size) }.

StandardNerd Over a year ago

@CarySwoveland yes, good point! Emily Did you see Cary Swovelands solution? Consider changing the accepted answer from mine to Cary Swoveland's. I'd prefer his solution and he did an extraordinary job explaining all the necessary steps in detail.

Sandra Cieseck Over a year ago

I just changed the accepted answer to Cary Swoveland's solution even though I keep using your solution, it's much shorter and does the trick.

Cary Swoveland · Accepted Answer · 2020-01-10 09:51:33Z

doc_count = [
  ["foo", "2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01", 
   "2019-02-01", "2019-03-01", "2019-04-01"],
  ["bar", "2019-01-01: 8876", "2019-04-01: 8694", "2019-01-01",
   "2019-02-01", "2019-03-01", "2019-04-01"]
]

We may write

def doit(doc_count)
  doc_count.map do |arr|
    date_strings, other_strings =
      arr.partition { |s| s.match? /\A\d{4}-\d{2}-\d{2}(?::|\z)/ }
    other_strings + select_dates(date_strings)
  end
end

where select_dates is a method yet to be constructed.

The calculations for doc_count[0] are as follows:

arr = doc_count[0]
  #=> ["foo", "2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
  #    "2019-02-01", "2019-03-01", "2019-04-01"] 
date_strings, other_strings =
  arr.partition { |s| s.match? /\A\d{4}-\d{2}-\d{2}(?::|\z)/ }
  #=> [["2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
  #     "2019-02-01", "2019-03-01", "2019-04-01"], ["foo"]] 
date_strings
  #=> ["2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
  #    "2019-02-01", "2019-03-01", "2019-04-01"] 
other_strings
  #=> ["foo"]

The calculations for the second element of doc_count are similar. See Enumerable#partition.

I will now give two ways to construct the method select_dates, the first being the more efficient, the second arguably the more straightforward.

Use the form of Hash#update (aka merge!) that employs a block to determine the values of keys that are present in both hashes being merged

def select_dates(date_strings)
  date_strings.each_with_object({}) do |s,h|
    h.update(s[0, 10]=>s) { |_,o,n| n.size >= o.size ? n : o }
  end.values
end

See the doc for explanations of the block variables _, o and n (_--a valid local variable--is used for the first block variable to tell the reader that it is not used in the block calculation). For date_strings given above for doc_count[0]

select_dates(date_strings)
  #=> ["2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
  #    "2019-04-01"]

The calculations are as follows.

 enum = date_strings.each_with_object({})
   #=> #<Enumerator: ["2019-02-01: 186904", "2019-03-01: 196961",
   #   "2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01"
   #                 ]:each_with_object({})>

 s,h = enum.next
   #=> ["2019-02-01: 186904", {}] 
 s #=> "2019-02-01: 186904" 
 h #=> {} 
 key = s[0, 10]
   #=> "2019-02-01" 
 h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
   #=> {"2019-02-01"=>"2019-02-01: 186904"}

 s,h = enum.next
   #=> ["2019-03-01: 196961", {"2019-02-01"=>"2019-02-01: 186904"}] 
 key = s[0, 10]
   #=> "2019-03-01" 
 h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
   #=> {"2019-02-01"=>"2019-02-01: 186904",
   #    "2019-03-01"=>"2019-03-01: 196961"}

 s,h = enum.next
   #=> ["2019-01-01", {"2019-02-01"=>"2019-02-01: 186904",
   #    "2019-03-01"=>"2019-03-01: 196961"}] 
 key = s[0, 10]
   #=> "2019-01-01" 
 h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
   #=> {"2019-02-01"=>"2019-02-01: 186904",
   #    "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01"}

 s,h = enum.next
   #=> ["2019-02-01", {"2019-02-01"=>"2019-02-01: 186904",
   #    "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01"}] 
 key = s[0, 10]
   #=> "2019-02-01" 
 h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
   #=> {"2019-02-01"=>"2019-02-01: 186904",
   #    "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01"}

For the first three elements of enum that are generated and passed to the block, update's block does not come into play, as the two hashes being merged (h and { key=>s }) do not have a common key. For the fourth element ("2019-02-01"), which is present in both hashes being merged, we defer to the block to compare h["2019-02-01"].size #=> "2019-02-01: 186904".size => 18 with "2019-02-01".size #=> 10. Since the former is larger we keep it as the value of "2019-02-01" in h. The remaining calculations for update are similar, resulting in:

h #=>  ["2019-02-01"=>"2019-02-01: 186904",
  #     "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01",
  #     "2019-04-01"=>"2019-04-01" }

The final step is to extract the values from this hash (h.values).

Use Array#uniq

def select_dates(date_strings)
  date_strings.sort_by(&:size).reverse.uniq { |s| s[0, 10] }
end

For date_strings given above for doc_count[0]

select_dates(date_strings)
  #=> ["2019-03-01: 196961", "2019-02-01: 186904", "2019-04-01",
  #    "2019-01-01"]

The calculations are as follows.

a = date_strings.sort_by(&:size)
  #=> ["2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01",
  #    "2019-02-01: 186904", "2019-03-01: 196961"] 
b = a.reverse
  #=> ["2019-03-01: 196961", "2019-02-01: 186904", "2019-04-01",
  #    "2019-03-01", "2019-02-01", "2019-01-01"] 
b.uniq { |s| s[0, 10] }
  #=> ["2019-03-01: 196961", "2019-02-01: 186904", "2019-04-01",
  #    "2019-01-01"]

Note that the doc for Array#uniq states, "self is traversed in order, and the first occurrence is kept.". The expression

sort_by(&:size).reverse

could be replaced by

sort_by { |s| -s.size }

but it has been reported that what I've used tends to be faster.

Collectives™ on Stack Overflow

ruby remove duplicates and modify existing element in nested array by iterating over another array

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related