doc_count = [
["foo", "2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
"2019-02-01", "2019-03-01", "2019-04-01"],
["bar", "2019-01-01: 8876", "2019-04-01: 8694", "2019-01-01",
"2019-02-01", "2019-03-01", "2019-04-01"]
]
We may write
def doit(doc_count)
doc_count.map do |arr|
date_strings, other_strings =
arr.partition { |s| s.match? /\A\d{4}-\d{2}-\d{2}(?::|\z)/ }
other_strings + select_dates(date_strings)
end
end
where select_dates is a method yet to be constructed.
The calculations for doc_count[0] are as follows:
arr = doc_count[0]
#=> ["foo", "2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
# "2019-02-01", "2019-03-01", "2019-04-01"]
date_strings, other_strings =
arr.partition { |s| s.match? /\A\d{4}-\d{2}-\d{2}(?::|\z)/ }
#=> [["2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
# "2019-02-01", "2019-03-01", "2019-04-01"], ["foo"]]
date_strings
#=> ["2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
# "2019-02-01", "2019-03-01", "2019-04-01"]
other_strings
#=> ["foo"]
The calculations for the second element of doc_count are similar. See Enumerable#partition.
I will now give two ways to construct the method select_dates, the first being the more efficient, the second arguably the more straightforward.
Use the form of Hash#update (aka merge!) that employs a block to determine the values of keys that are present in both hashes being merged
def select_dates(date_strings)
date_strings.each_with_object({}) do |s,h|
h.update(s[0, 10]=>s) { |_,o,n| n.size >= o.size ? n : o }
end.values
end
See the doc for explanations of the block variables _, o and n (_--a valid local variable--is used for the first block variable to tell the reader that it is not used in the block calculation). For date_strings given above for doc_count[0]
select_dates(date_strings)
#=> ["2019-02-01: 186904", "2019-03-01: 196961", "2019-01-01",
# "2019-04-01"]
The calculations are as follows.
enum = date_strings.each_with_object({})
#=> #<Enumerator: ["2019-02-01: 186904", "2019-03-01: 196961",
# "2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01"
# ]:each_with_object({})>
s,h = enum.next
#=> ["2019-02-01: 186904", {}]
s #=> "2019-02-01: 186904"
h #=> {}
key = s[0, 10]
#=> "2019-02-01"
h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
#=> {"2019-02-01"=>"2019-02-01: 186904"}
s,h = enum.next
#=> ["2019-03-01: 196961", {"2019-02-01"=>"2019-02-01: 186904"}]
key = s[0, 10]
#=> "2019-03-01"
h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
#=> {"2019-02-01"=>"2019-02-01: 186904",
# "2019-03-01"=>"2019-03-01: 196961"}
s,h = enum.next
#=> ["2019-01-01", {"2019-02-01"=>"2019-02-01: 186904",
# "2019-03-01"=>"2019-03-01: 196961"}]
key = s[0, 10]
#=> "2019-01-01"
h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
#=> {"2019-02-01"=>"2019-02-01: 186904",
# "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01"}
s,h = enum.next
#=> ["2019-02-01", {"2019-02-01"=>"2019-02-01: 186904",
# "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01"}]
key = s[0, 10]
#=> "2019-02-01"
h.update(key=>s) { |_,o,n| n.size >= o.size ? n : o }
#=> {"2019-02-01"=>"2019-02-01: 186904",
# "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01"}
For the first three elements of enum that are generated and passed to the block, update's block does not come into play, as the two hashes being merged (h and { key=>s }) do not have a common key. For the fourth element ("2019-02-01"), which is present in both hashes being merged, we defer to the block to compare h["2019-02-01"].size #=> "2019-02-01: 186904".size => 18 with "2019-02-01".size #=> 10. Since the former is larger we keep it as the value of "2019-02-01" in h. The remaining calculations for update are similar, resulting in:
h #=> ["2019-02-01"=>"2019-02-01: 186904",
# "2019-03-01"=>"2019-03-01: 196961", "2019-01-01"=>"2019-01-01",
# "2019-04-01"=>"2019-04-01" }
The final step is to extract the values from this hash (h.values).
Use Array#uniq
def select_dates(date_strings)
date_strings.sort_by(&:size).reverse.uniq { |s| s[0, 10] }
end
For date_strings given above for doc_count[0]
select_dates(date_strings)
#=> ["2019-03-01: 196961", "2019-02-01: 186904", "2019-04-01",
# "2019-01-01"]
The calculations are as follows.
a = date_strings.sort_by(&:size)
#=> ["2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01",
# "2019-02-01: 186904", "2019-03-01: 196961"]
b = a.reverse
#=> ["2019-03-01: 196961", "2019-02-01: 186904", "2019-04-01",
# "2019-03-01", "2019-02-01", "2019-01-01"]
b.uniq { |s| s[0, 10] }
#=> ["2019-03-01: 196961", "2019-02-01: 186904", "2019-04-01",
# "2019-01-01"]
Note that the doc for Array#uniq states, "self is traversed in order, and the first occurrence is kept.". The expression
sort_by(&:size).reverse
could be replaced by
sort_by { |s| -s.size }
but it has been reported that what I've used tends to be faster.
"2019-04-01"occurs before"2019-04-01: 8694"? Which one should be removed?goal"lines" are ordered by complex rule: "foo" at the begin and then by date.