0

I am trying to analyze some documents and find similarities in them. After analysis, I have an array, the elements of which are arrays of data from documents considered similar. But sometimes I have two almost similar elements, and naturally I want to leave the biggest of them. For simplification:

data = [[1,2,3,4,5,6], [7,8,9,10], [1,2,3,5,6]...]

How do I efficiently process the data that I get:

data = [[1,2,3,4,5,6], [7,8,9,10]...]

I suppose I could intersect every array, and if the intersected array matches one of the original arrays - I ignore it. Here is a quick code I wrote:

data = [[1,2,3,4,5,6], [7,8,9,10], [1,2,3,5,6], [7,9,10]]
cleaned = []

data.each_index do |i|
  similar = false
  data.each_index do |j|
    if i == j
      next
    elsif data[i]&data[j] == data[i]
      similar = true
      break
    end
  end
  unless similar
    cleaned << data[i]
  end
end

puts cleaned.inspect

Is this an efficient way to go? Also, the current behaviour only allows to leave out arrays that are a few elements short, and I might want to merge similar arrays if they occur:

[[1,2,3,4,5], [1,3,4,5,6]] => [[1,2,3,4,5,6]]
4
  • 1
    The last part of your question requires you to define "similar enough to merge" - for example how many elements (or what ratio of them) need to match before you want to merge? Commented Jun 18, 2014 at 10:17
  • @NeilSlater, It haven't thought about it yet and haven't analysed much data to see how often this happens. From one batch of documents I noticed two similar arrays of data, which differed by one element ( [same_stuff, x], [same_stuff, y]. I suppose the difference might be slightly bigger. Commented Jun 18, 2014 at 10:57
  • -1 The question is not well defined. Commented Jun 18, 2014 at 11:44
  • @sawa what exactly is not well defined? You can ask for clarification if I explained something that is hard to understand. Commented Jun 18, 2014 at 12:19

1 Answer 1

1

You can delete any element in the list if it is fully contained in another element:

data.delete_if do |arr|
  data.any? { |a2| !a2.equal?(arr) && arr - a2 == [] }
end
# => [[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]]

This is a bit more efficient than your suggestion since once you decide that an element should be removed, you don't check against it in the next iterations.

Sign up to request clarification or add additional context in comments.

6 Comments

Great, this is indeed more efficient. When I want to merge some of the elements, I will have to resort to something else, I guess.
This will fail to remove duplicate arrays.
That only partly helps.
@sawa, actually, I removed the uniq!, as it will affect the performance. I replaced != with !.equal? instead. This will remove duplicates and have better performance than !=
With your previous solution using uniq!, you were able to remove exact duplicates but not the arrays that have the same elements but in different order. Your solution with equal? will fully work. But you should have tried harder to think by yourself before asking me.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.