2

I have an array doc_count that contains string with Date (year-month-day) among other things.

I'd like to transform doc_count into goal by removing "duplicates" that means, I'd like to keep the longer date-string and remove the short date-string eg.

"2019-02-01: 186904" instead of "2019-02-01"

  doc_count = [
    ["foo", "2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01"],
    ["bar", "2019-01-01: 8876", "2019-04-01: 8694", "2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01"]
  ]

  goal = [
    ["foo", "2019-01-01", "2019-02-01: 186904", "2019-03-01: 196961", "2019-04-01"],
    ["bar", "2019-01-01: 8876", "2019-02-01", "2019-03-01", "2019-04-01: 8694"]
  ]

   month.match(/^\d{4}-\d{2}-\d{2}/) && month.include?(': ') ? 
     month.match(/^\d{4}-\d{2}-\d{2}/)[0] : month

  my_attempt = doc_count.each do |topic|
    topic.each do |el|
      topic.delete(el) if el == string_to_month(el)
    end
  end

For some reason my attempt fails to generate an array identical to goals.

  2.6.3 (main):0 > my_attempt       
  => [
    [0] [
      [0] "foo",
      [1] "2019-02-01: 186904",
      [2] "2019-03-01: 196961",
      [3] "2019-02-01",
      [4] "2019-04-01"
    ],
    [1] [
      [0] "bar",
      [1] "2019-01-01: 8876",
      [2] "2019-04-01: 8694",
      [3] "2019-02-01",
      [4] "2019-04-01"
    ]
  ]

How can I fix this? Thank you very much!

3
  • What if "2019-04-01" occurs before "2019-04-01: 8694"? Which one should be removed? Commented Jan 9, 2020 at 14:27
  • @PavelMikhailyuk string with a colon should always stay in array and the string only with date should be removed. the order in array isn't relevant. Commented Jan 9, 2020 at 14:37
  • "the order in array isn't relevant" but example shows, that goal "lines" are ordered by complex rule: "foo" at the begin and then by date. Commented Jan 9, 2020 at 14:57

2 Answers 2

2

One solution could be a combination of Array#flat_map und #max_by

method #flat_map returns a new array with the concatenated results of running block once for every element and #max_by an array of maximum elements. You already used #match to check for date-format but in this example there's no need to move it in a separate method.

solution = doc_count.map do |topic|
  topic.group_by { |s| s[0..9] }.flat_map do 
    |key, values| key.match?(/^\d{4}-\d{2}-\d{2}/) ? [values.max_by(&:size)] : values 
  end.sort.rotate!(-1)
end

last but not least #sort and #rotate(-1) to get the desired sort order of the array.

UPDATE: please use Cary Swoveland's solution, it's better and he did an extraordinary job to explain the steps in detail.

Sign up to request clarification or add additional context in comments.

3 Comments

Perhaps neither better nor worse, but consider first separating the dates from the non-dates, then non_dates + dates.group_by(&:size).values.map { |a| a.max_by(&:size) }.
@CarySwoveland yes, good point! Emily Did you see Cary Swovelands solution? Consider changing the accepted answer from mine to Cary Swoveland's. I'd prefer his solution and he did an extraordinary job explaining all the necessary steps in detail.
I just changed the accepted answer to Cary Swoveland's solution even though I keep using your solution, it's much shorter and does the trick.
1
doc_count = [
  ["foo", "2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01", 
   "2019-02-01", "2019-03-01", "2019-04-01"],
  ["bar", "2019-01-01: 8876", "2019-04-01: 8694", "2019-01-01",
   "2019-02-01", "2019-03-01", "2019-04-01"]
]

We may write

def doit(doc_count)
  doc_count.map do |arr|
    date_strings, other_strings =
      arr.partition { |s| s.match? /\A\d{4}-\d{2}-\d{2}(?::|\z)/ }
    other_strings + select_dates(date_strings)
  end
end

where select_dates is a method yet to be constructed.

The calculations for doc_count[0] are as follows:

arr = doc_count[0]
  #=> ["foo", "2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
  #    "2019-02-01", "2019-03-01", "2019-04-01"] 
date_strings, other_strings =
  arr.partition { |s| s.match? /\A\d{4}-\d{2}-\d{2}(?::|\z)/ }
  #=> [["2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
  #     "2019-02-01", "2019-03-01", "2019-04-01"], ["foo"]] 
date_strings
  #=> ["2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
  #    "2019-02-01", "2019-03-01", "2019-04-01"] 
other_strings
  #=> ["foo"] 

The calculations for the second element of doc_count are similar. See Enumerable#partition.

I will now give two ways to construct the method select_dates, the first being the more efficient, the second arguably the more straightforward.

Use the form of Hash#update (aka merge!) that employs a block to determine the values of keys that are present in both hashes being merged

def select_dates(date_strings)
  date_strings.each_with_object({}) do |s,h|
    h.update(s[0, 10]=>s) { |_,o,n| n.size >= o.size ? n : o }
  end.values
end

See the doc for explanations of the block variables _, o and n (_--a valid local variable--is used for the first block variable to tell the reader that it is not used in the block calculation). For date_strings given above for doc_count[0]

select_dates(date_strings)
  #=> ["2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
  #    "2019-04-01"] 

The calculations are as follows.

 enum = date_strings.each_with_object({})
   #=> #<Enumerator: ["2019-02-01: 186904", "2019-03-01: 196961",
   #   "2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01"
   #                 ]:each_with_object({})> 

 s,h = enum.next
   #=> ["2019-02-01: 186904", {}] 
 s #=> "2019-02-01: 186904" 
 h #=> {} 
 key = s[0, 10]
   #=> "2019-02-01" 
 h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
   #=> {"2019-02-01"=>"2019-02-01: 186904"} 

 s,h = enum.next
   #=> ["2019-03-01: 196961", {"2019-02-01"=>"2019-02-01: 186904"}] 
 key = s[0, 10]
   #=> "2019-03-01" 
 h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
   #=> {"2019-02-01"=>"2019-02-01: 186904",
   #    "2019-03-01"=>"2019-03-01: 196961"} 

 s,h = enum.next
   #=> ["2019-01-01", {"2019-02-01"=>"2019-02-01: 186904",
   #    "2019-03-01"=>"2019-03-01: 196961"}] 
 key = s[0, 10]
   #=> "2019-01-01" 
 h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
   #=> {"2019-02-01"=>"2019-02-01: 186904",
   #    "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01"}

 s,h = enum.next
   #=> ["2019-02-01", {"2019-02-01"=>"2019-02-01: 186904",
   #    "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01"}] 
 key = s[0, 10]
   #=> "2019-02-01" 
 h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
   #=> {"2019-02-01"=>"2019-02-01: 186904",
   #    "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01"} 

For the first three elements of enum that are generated and passed to the block, update's block does not come into play, as the two hashes being merged (h and { key=>s }) do not have a common key. For the fourth element ("2019-02-01"), which is present in both hashes being merged, we defer to the block to compare h["2019-02-01"].size #=> "2019-02-01: 186904".size => 18 with "2019-02-01".size #=> 10. Since the former is larger we keep it as the value of "2019-02-01" in h. The remaining calculations for update are similar, resulting in:

h #=>  ["2019-02-01"=>"2019-02-01: 186904",
  #     "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01",
  #     "2019-04-01"=>"2019-04-01" }

The final step is to extract the values from this hash (h.values).

Use Array#uniq

def select_dates(date_strings)
  date_strings.sort_by(&:size).reverse.uniq { |s| s[0, 10] }
end

For date_strings given above for doc_count[0]

select_dates(date_strings)
  #=> ["2019-03-01: 196961", "2019-02-01: 186904", "2019-04-01",
  #    "2019-01-01"] 

The calculations are as follows.

a = date_strings.sort_by(&:size)
  #=> ["2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01",
  #    "2019-02-01: 186904", "2019-03-01: 196961"] 
b = a.reverse
  #=> ["2019-03-01: 196961", "2019-02-01: 186904", "2019-04-01",
  #    "2019-03-01", "2019-02-01", "2019-01-01"] 
b.uniq { |s| s[0, 10] }
  #=> ["2019-03-01: 196961", "2019-02-01: 186904", "2019-04-01",
  #    "2019-01-01"] 

Note that the doc for Array#uniq states, "self is traversed in order, and the first occurrence is kept.". The expression

sort_by(&:size).reverse

could be replaced by

sort_by { |s| -s.size }

but it has been reported that what I've used tends to be faster.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.