Efficient way of removing similar arrays in an array of arrays

Question

I am trying to analyze some documents and find similarities in them. After analysis, I have an array, the elements of which are arrays of data from documents considered similar. But sometimes I have two almost similar elements, and naturally I want to leave the biggest of them. For simplification:

data = [[1,2,3,4,5,6], [7,8,9,10], [1,2,3,5,6]...]

How do I efficiently process the data that I get:

data = [[1,2,3,4,5,6], [7,8,9,10]...]

I suppose I could intersect every array, and if the intersected array matches one of the original arrays - I ignore it. Here is a quick code I wrote:

data = [[1,2,3,4,5,6], [7,8,9,10], [1,2,3,5,6], [7,9,10]]
cleaned = []

data.each_index do |i|
  similar = false
  data.each_index do |j|
    if i == j
      next
    elsif data[i]&data[j] == data[i]
      similar = true
      break
    end
  end
  unless similar
    cleaned << data[i]
  end
end

puts cleaned.inspect

Is this an efficient way to go? Also, the current behaviour only allows to leave out arrays that are a few elements short, and I might want to merge similar arrays if they occur:

[[1,2,3,4,5], [1,3,4,5,6]] => [[1,2,3,4,5,6]]

The last part of your question requires you to define "similar enough to merge" - for example how many elements (or what ratio of them) need to match before you want to merge? — Neil Slater
– Neil Slater, Commented Jun 18, 2014 at 10:17
@NeilSlater, It haven't thought about it yet and haven't analysed much data to see how often this happens. From one batch of documents I noticed two similar arrays of data, which differed by one element ( [same_stuff, x], [same_stuff, y]. I suppose the difference might be slightly bigger. — Vilmar
– Vilmar, Commented Jun 18, 2014 at 10:57
@sawa what exactly is not well defined? You can ask for clarification if I explained something that is hard to understand. — Vilmar
– Vilmar, Commented Jun 18, 2014 at 12:19

Uri Agassi · Accepted Answer · 2014-06-18 11:44:48Z

1

You can delete any element in the list if it is fully contained in another element:

data.delete_if do |arr|
  data.any? { |a2| !a2.equal?(arr) && arr - a2 == [] }
end
# => [[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]]

This is a bit more efficient than your suggestion since once you decide that an element should be removed, you don't check against it in the next iterations.

edited Jun 18, 2014 at 11:44

answered Jun 18, 2014 at 10:30

Uri Agassi

37.5k16 gold badges82 silver badges96 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Vilmar Over a year ago

Great, this is indeed more efficient. When I want to merge some of the elements, I will have to resort to something else, I guess.

sawa Over a year ago

This will fail to remove duplicate arrays.

sawa Over a year ago

That only partly helps.

Uri Agassi Over a year ago

@sawa, actually, I removed the uniq!, as it will affect the performance. I replaced != with !.equal? instead. This will remove duplicates and have better performance than !=

sawa Over a year ago

With your previous solution using uniq!, you were able to remove exact duplicates but not the arrays that have the same elements but in different order. Your solution with equal? will fully work. But you should have tried harder to think by yourself before asking me.

|

Collectives™ on Stack Overflow

Efficient way of removing similar arrays in an array of arrays

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related