4

Currently I face this question For example I have this array of hash

data = [
  {:id => 1,:start_date => "2015-01-02",:end_date => "2015-01-05"},
  {:id => 2,:start_date => "2015-01-06",:end_date => "2015-01-07"},
  {:id => 3,:start_date => "2015-01-10",:end_date => "2015-01-20"}
]

So I want to find the exact hash that have "2015-01-04" in range of above hashes's start_date and end_date

Follow the document I find out there are 3 ways to do this

1) Use select

finding_hash = data.select {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}

finding_hash will return an array of needed hash But as i do this,i assure that there will always only one hash match the condition do after do this SELECT i have to finding_hash.first to get the hash i want

2)Use find

finding_hash = data.find{|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}

This way of doing ,the finding_hash IS the result hash i need

3)Traditional loop

data.each do |t|
  if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
    return t
    break
  end
end

So which one is the fastest way to do this.I do need the performance because my data is quite big!

Thank you and sorry for my bad english!

6
  • 2
    If your data is quite big then you should throw it in a database and index it. Even SQLite would probably eat up something like this. Commented May 13, 2015 at 3:44
  • Is it safe to assume that the hashes in the array are sorted by date? Commented May 13, 2015 at 4:39
  • @spickermann : no,it is random my friend Commented May 13, 2015 at 6:04
  • So you have three pieces of code and you want to know which is fastest. Why don't you measure the performance? Commented May 13, 2015 at 10:26
  • @SergioTulentsev : im sorry,i dont know the proper way to do it so i add new 2 values start_time and end_time.and in the end of each code,i put out end_time - start_time ,but it didnt work out so well... Commented May 14, 2015 at 1:37

4 Answers 4

2

All the methods you've tried are Enumerable methods, but the native Array methods are faster. Try find_index. Even after having to make a separate call to load the hash it's still about 20% faster than the next fastest:

index = data.find_index {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
x = data[index]

My benchmarks:

n = 1_000_000

data = [
  {:id => 1,:start_date => "2015-01-02",:end_date => "2015-01-05"},
  {:id => 2,:start_date => "2015-01-06",:end_date => "2015-01-07"},
  {:id => 3,:start_date => "2015-01-10",:end_date => "2015-01-20"}
]

Benchmark.bm do |x|
  x.report 'Enumerable#select' do
    n.times do
      data.select do |h|
        h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"
      end
    end
  end

  x.report 'Enumerable#detect' do
    n.times do
      data.detect do |h|
        h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"
      end
    end
  end

  x.report 'Enumerable#each  ' do
    n.times do
      finding_hash = {}
      data.each do |t|
        if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
          finding_hash = t
          break t
        end
      end
    end
  end

  x.report 'Array#find_index ' do
    n.times do
       index = data.find_index {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
       x = data[index]
    end
  end
end

Results are:

Enumerable#select  1.000000   0.010000   1.010000 (  1.002282)
Enumerable#detect  0.790000   0.000000   0.790000 (  0.797319)
Enumerable#each    0.620000   0.000000   0.620000 (  0.627272)
Array#find_index   0.520000   0.000000   0.520000 (  0.515691)
Sign up to request clarification or add additional context in comments.

Comments

2

You can test by benchmark

For example:

require 'benchmark'

n = 1000000

data = [
  {:id => 1,:start_date => "2015-01-02",:end_date => "2015-01-05"},
  {:id => 2,:start_date => "2015-01-06",:end_date => "2015-01-07"},
  {:id => 3,:start_date => "2015-01-10",:end_date => "2015-01-20"}
]


Benchmark.bm do |x|

x.report { n.times do
   data.select {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
   end
}

x.report { n.times do
 data.find{|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
  end

 }

x.report {
n.times do
   finding_hash = {}
   data.each do |t|
     if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
       finding_hash = t
       break
     end
    end
end
}

end

output:

       user     system      total        real
   1.490000   0.020000   1.510000 (  1.533589)
   1.070000   0.010000   1.080000 (  1.096578)
   1.000000   0.010000   1.010000 (  1.011021)

Test results is related to the value of n and the data size.

2 Comments

@DuongBach: don't post "thank you" comments. Upvote is the best thanks (if you truly think it's helpful)
@Sergio Tulentsev, sorry, I found the mistake, i will correct
1

v3 is fastest:

def v1
  @data.select {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
end

def v2
  @data.find{|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
end

def v3
  @data.each do |t|
    if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
      return t
      break
    end
  end
end

select will always be slowest because it has to iterate through the entire array. I'm not sure why find is slower than v3. It may have to do with overhead.

However, find and v3 may be the same for your data. The results below are not necessarily valid for your data.

t = Time.now; 10000.times{ v1 }; Time.now - t
=> 0.014131

t = Time.now; 10000.times{ v2 }; Time.now - t
=> 0.013138

t = Time.now; 10000.times{ v3 }; Time.now - t
=> 0.008799

Running this on the sample data is not the same thing as running it on your real data.

If the real data is too large, you can run it on a subset of the data to get a better answer.

BTW, you can rewrite v3 as:

data.each do |t|
  break t if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
end

FWIW, operating on an array is going to be very unwieldy and slow. You may want to save it in a database and run a query. For a large dataset, that would probably be at least 2 orders of magnitude faster.

Comments

1

All those variants are O(n) complexity. If your ranges are not overlapping you can use bsearch of array which is O(log n) complexity. You should sort your ranges at first.

sorted = data.sort_by { |x| x[:start_date] }
sorted.bsearch { |x| ..check if range of `x` includes value.. }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.