I can very easily remove duplicates in an array by using .uniq, but how would I go about doing it without using the .uniq method?
-
Could you add the reason you want to know? Just curious?Schwern– Schwern2015-12-26 20:21:10 +00:00Commented Dec 26, 2015 at 20:21
-
1Hey there! I'm currently learning Ruby/Web dev stuff, so I have some challenges I'm going through and this one is stumping me like crazy.Dylan Bailey– Dylan Bailey2015-12-26 20:22:19 +00:00Commented Dec 26, 2015 at 20:22
-
It doesn't remove uniques. It retains the uniques.sawa– sawa2015-12-26 20:45:33 +00:00Commented Dec 26, 2015 at 20:45
-
Why the rush to select an answer?Cary Swoveland– Cary Swoveland2015-12-26 21:34:28 +00:00Commented Dec 26, 2015 at 21:34
5 Answers
a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]
class Array
def my_uniq
self | []
end
end
a.my_uniq
#=> [1, 2, 4, 3, 5, 6]
This uses the method Array#|: "Set Union — Returns a new array by joining ary with other_ary, excluding any duplicates and preserving the order from the original array."
Here is a benchmark for the various answers, as well as Array#uniq.
require 'fruity'
require 'set'
def doit(n, m)
arr = n.times.to_a
arr = m.times.map { arr.sample }
compare do
uniq { arr.uniq }
Schwern { uniq = []; arr.sort.each { |e| uniq.push(e) if e != uniq[-1]; uniq } }
Sharma {b = []; arr.each{ |aa| b << aa unless b.include?(aa) }; b }
Mihael { arr.to_set.to_a }
sawa { arr.group_by(&:itself).keys }
Cary { arr | [] }
end
end
doit(1_000, 500)
# Schwern is faster than uniq by 19.999999999999996% ± 10.0% (results differ)
# uniq is similar to Cary
# Cary is faster than Mihael by 10.000000000000009% ± 10.0%
# Mihael is similar to sawa
# sawa is faster than Sharma by 5x ± 0.1
doit(100_000, 50_000)
# Schwern is faster than uniq by 50.0% ± 10.0% (results differ)
# uniq is similar to Cary
# Cary is similar to Mihael
# Mihael is faster than sawa by 10.000000000000009% ± 10.0%
# sawa is faster than Sharma by 310x ± 10.0
"Schwern" and "uniq" return arrays containing the same elements but not in the same order (hence "results differ").
Here is the additional benchmark requested by @Schern.
def doit1(n)
arr = n.times.map { rand(n/10) }
compare do
uniq { arr.uniq }
Schwern { uniq = []; arr.sort.each { |e| uniq.push(e) if e != uniq[-1]; uniq } }
Sharma {b = []; arr.each{ |aa| b << aa unless b.include?(aa) }; b }
Mihael { arr.to_set.to_a }
sawa { arr.group_by(&:itself).keys }
Cary { arr | [] }
end
end
doit1(1_000)
# Cary is similar to uniq
# uniq is faster than sawa by 3x ± 1.0
# sawa is similar to Schwern (results differ)
# Schwern is similar to Mihael (results differ)
# Mihael is faster than Sharma by 2x ± 0.1
doit1(50_000)
# Cary is similar to uniq
# uniq is faster than Schwern by 2x ± 1.0 (results differ)
# Schwern is similar to Mihael (results differ)
# Mihael is similar to sawa
# sawa is faster than Sharma by 62x ± 10.0
4 Comments
Array#uniq. Array#| puts the first array in a hash, loops through the elements of the second array adding them if they're not already there, and then returns the keys of the hash. Since the second array is empty, this just puts self into a hash and returns the keys, just like Array#uniq.arr = m.times.map { rand(m/10) }.The code for most Ruby methods can be found in the ruby-doc.org API documentation. If you mouse over a method's documentation, a "click to toggle source" button appears. The code is in C, but it's very easy to understand.
if (RARRAY_LEN(ary) <= 1)
return rb_ary_dup(ary);
if (rb_block_given_p()) {
hash = ary_make_hash_by(ary);
uniq = rb_hash_values(hash);
}
else {
hash = ary_make_hash(ary);
uniq = rb_hash_values(hash);
}
If there's one element, return it. Otherwise turn the elements into hash keys, turn the hash back into an array. By a documented quirk of Ruby hashes, "Hashes enumerate their values in the order that the corresponding keys were inserted", this technique preserves the original order of the elements in the Array. In other languages it may not.
Alternatively, use a Set. A set will never have duplicates. Loading set adds the method to_set to all Enumerable objects, which includes Arrays. However, a Set is usually implemented as a Hash so you're doing the same thing. If you want a unique array, and if you don't need the elements to be ordered, you should probably instead make a set and use that. unique = array.to_set
Alternatively, sort the Array and loop through it pushing each element onto a new Array. If the last element on the new Array matches the current element, discard it.
array = [2, 3, 4, 5, 1, 2, 4, 5];
uniq = []
# This copies the whole array and the duplicates, wasting
# memory. And sort is O(nlogn).
array.sort.each { |e|
uniq.push(e) if e != uniq[-1]
}
[1, 2, 3, 4, 5]
puts uniq.inspect
This method is to be avoided because it is slower and takes more memory than the other methods. The sort makes it slower. Sorting is O(nlogn) meaning as the array gets bigger sorting will get slower quicker than the array grows. It also requires you to copy the whole array, with duplicates, unless you want to alter the original data by sorting in place with sort!.
The other methods are O(n) speed and O(n) memory meaning they will scale linearly as the array gets bigger. And they don't have to copy the duplicates which can use substantially less memory.
8 Comments
Array#uniq implementation takes advantage of that. Using a Set will lose the order, but I noted that caveat. Sorting will also lose the original order, but that's a bad way to do it anyway.require 'set'; a = 0.upto(100_000).to_a; b = 50_000.times.map { a.sample }; b.to_set.to_a == b.uniq #=> true.Set#to_a documents "the order of elements is uncertain." As for the speed of the sort technique, it's O(nlogn). While it may edge out the hash implementation for small arrays, it rapidly gets worse as the array grows. At 10 elements Array#uniq and sort are the same speed. At 100 sort is already 3 times slower. At 1000 it's 6 times slower. Using a set is also slow. pastebin.com/2QaLc6JnYou can use #to_set Read more about it here
2 Comments
Set.new [1,2,3,3,3,4,4], This returns a Set of unique objects.array.group_by(&:itself).keys
......................
2 Comments
You can also try this, check following example.
a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]
b = []
a.each{ |aa| b << aa unless b.include?(aa) }
# when you check b you will get following result.
[1, 2, 4, 3, 5, 6]
alternatively you can also try following
a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]
b = a & a
# OR
b = a | a
# both will return following result
[1, 2, 4, 3, 5, 6]
3 Comments
b.include?(aa). That has to potentially search all of b every time through the loop. A loop with an O(n) operation inside it is O(n^2). This is why it's better for b to be a Hash. Hash lookups are O(1) and they cannot have duplicates. Rule of thumb: any time you write array.include? ask yourself "can I use a Hash instead?"include? is indeed wasteful. You should ask yourself one of two questions. One is given by @Schwern. The other is: "Should I instead create a set and then convert it to an array?O(n) solutions. If you're just going to throw away the set, I'd use a hash. If you're going to hold onto the unique array, I'd not bother with the array and instead make and use a set.