Ruby Looking Array of hash Performance

Question

Currently I face this question For example I have this array of hash

data = [
  {:id => 1,:start_date => "2015-01-02",:end_date => "2015-01-05"},
  {:id => 2,:start_date => "2015-01-06",:end_date => "2015-01-07"},
  {:id => 3,:start_date => "2015-01-10",:end_date => "2015-01-20"}
]

So I want to find the exact hash that have "2015-01-04" in range of above hashes's start_date and end_date

Follow the document I find out there are 3 ways to do this

1) Use select

finding_hash = data.select {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}

finding_hash will return an array of needed hash But as i do this,i assure that there will always only one hash match the condition do after do this SELECT i have to finding_hash.first to get the hash i want

2)Use find

finding_hash = data.find{|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}

This way of doing ,the finding_hash IS the result hash i need

3)Traditional loop

data.each do |t|
  if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
    return t
    break
  end
end

So which one is the fastest way to do this.I do need the performance because my data is quite big!

Thank you and sorry for my bad english!

If your data is quite big then you should throw it in a database and index it. Even SQLite would probably eat up something like this. — mu is too short
– mu is too short, Commented May 13, 2015 at 3:44
Is it safe to assume that the hashes in the array are sorted by date? — spickermann
– spickermann, Commented May 13, 2015 at 4:39
So you have three pieces of code and you want to know which is fastest. Why don't you measure the performance? — Sergio Tulentsev
– Sergio Tulentsev, Commented May 13, 2015 at 10:26
@SergioTulentsev : im sorry,i dont know the proper way to do it so i add new 2 values start_time and end_time.and in the end of each code,i put out end_time - start_time ,but it didnt work out so well... — Duong Bach
– Duong Bach, Commented May 14, 2015 at 1:37

infused · Accepted Answer · 2015-05-13 03:57:11Z

All the methods you've tried are Enumerable methods, but the native Array methods are faster. Try find_index. Even after having to make a separate call to load the hash it's still about 20% faster than the next fastest:

index = data.find_index {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
x = data[index]

My benchmarks:

n = 1_000_000

data = [
  {:id => 1,:start_date => "2015-01-02",:end_date => "2015-01-05"},
  {:id => 2,:start_date => "2015-01-06",:end_date => "2015-01-07"},
  {:id => 3,:start_date => "2015-01-10",:end_date => "2015-01-20"}
]

Benchmark.bm do |x|
  x.report 'Enumerable#select' do
    n.times do
      data.select do |h|
        h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"
      end
    end
  end

  x.report 'Enumerable#detect' do
    n.times do
      data.detect do |h|
        h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"
      end
    end
  end

  x.report 'Enumerable#each  ' do
    n.times do
      finding_hash = {}
      data.each do |t|
        if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
          finding_hash = t
          break t
        end
      end
    end
  end

  x.report 'Array#find_index ' do
    n.times do
       index = data.find_index {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
       x = data[index]
    end
  end
end

Results are:

Enumerable#select  1.000000   0.010000   1.010000 (  1.002282)
Enumerable#detect  0.790000   0.000000   0.790000 (  0.797319)
Enumerable#each    0.620000   0.000000   0.620000 (  0.627272)
Array#find_index   0.520000   0.000000   0.520000 (  0.515691)

pangpang · Accepted Answer · 2015-05-13 10:45:19Z

2

You can test by benchmark

For example:

require 'benchmark'

n = 1000000

data = [
  {:id => 1,:start_date => "2015-01-02",:end_date => "2015-01-05"},
  {:id => 2,:start_date => "2015-01-06",:end_date => "2015-01-07"},
  {:id => 3,:start_date => "2015-01-10",:end_date => "2015-01-20"}
]


Benchmark.bm do |x|

x.report { n.times do
   data.select {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
   end
}

x.report { n.times do
 data.find{|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
  end

 }

x.report {
n.times do
   finding_hash = {}
   data.each do |t|
     if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
       finding_hash = t
       break
     end
    end
end
}

end

output:

       user     system      total        real
   1.490000   0.020000   1.510000 (  1.533589)
   1.070000   0.010000   1.080000 (  1.096578)
   1.000000   0.010000   1.010000 (  1.011021)

Test results is related to the value of n and the data size.

edited May 13, 2015 at 10:45

answered May 13, 2015 at 3:22

pangpang

8,86915 gold badges67 silver badges99 bronze badges

2 Comments

Sergio Tulentsev Over a year ago

@DuongBach: don't post "thank you" comments. Upvote is the best thanks (if you truly think it's helpful)

pangpang Over a year ago

@Sergio Tulentsev, sorry, I found the mistake, i will correct

B Seven · Accepted Answer · 2015-05-13 03:35:17Z

v3 is fastest:

def v1
  @data.select {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
end

def v2
  @data.find{|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
end

def v3
  @data.each do |t|
    if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
      return t
      break
    end
  end
end

select will always be slowest because it has to iterate through the entire array. I'm not sure why find is slower than v3. It may have to do with overhead.

However, find and v3 may be the same for your data. The results below are not necessarily valid for your data.

t = Time.now; 10000.times{ v1 }; Time.now - t
=> 0.014131

t = Time.now; 10000.times{ v2 }; Time.now - t
=> 0.013138

t = Time.now; 10000.times{ v3 }; Time.now - t
=> 0.008799

Running this on the sample data is not the same thing as running it on your real data.

If the real data is too large, you can run it on a subset of the data to get a better answer.

BTW, you can rewrite v3 as:

data.each do |t|
  break t if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
end

FWIW, operating on an array is going to be very unwieldy and slow. You may want to save it in a database and run a query. For a large dataset, that would probably be at least 2 orders of magnitude faster.

lx00st · Accepted Answer · 2015-05-13 10:04:46Z

1

All those variants are O(n) complexity. If your ranges are not overlapping you can use bsearch of array which is O(log n) complexity. You should sort your ranges at first.

sorted = data.sort_by { |x| x[:start_date] }
sorted.bsearch { |x| ..check if range of `x` includes value.. }

edited May 13, 2015 at 10:04

answered May 13, 2015 at 7:27

lx00st

1,5969 silver badges17 bronze badges

Collectives™ on Stack Overflow

Ruby Looking Array of hash Performance

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related