I have files with 1e8 numbers in them ranging in size from 1 to 999. I need to read through each file and save a report of how many of each number were found in each file. I thought that setting up a constant array of all zeros and then incrementing using the number I just read as the index would give me the answer. Perl syntax for doing this is not what I expected. Not all numbers will necessarily occur. Maybe hashes is a way to go but the array will likely only have a few holes in it. Any help? Thanks.
3 Answers
Perl syntax for doing this is not what I expected.
Show your work
Not all numbers will necessarily occur.
Skip the zeros (simple if block, see http://perldoc.perl.org/perlintro.html )
Maybe hashes is a way to go but the array will likely only have a few holes in it.
Yes, hashes are a natural fit, see perlfaq4, search for "count", "uniq", and "duplicate"
Comments
Tom Duff (creator of Duff's Device): "If your code is too slow, you must make it faster. If no better algorithm is available, you must trim cycles."
I disagree that this is a situation where a hash is the best fit. Sure, a hash is the idiomatic fit, it's the method mentioned in perlfaq4, and it doesn't waste elements in your counter container. But he's talking about 100_000_000 integers between 1 and 999. The number of elements being used in a counter container is insignificant. The number of iterations necessary to get a count is very significant. 100,000,000 iterations takes a lot of time.
If we use an array instead, we throw away the element indexed at zero. And if all of the integers are the same value, we throw away 998 more elements. Is that such a big deal? On the other hand, even though both indexing into an array, and indexing into a hash aggregate out to O(1) operations, Big-O notation only tells part of the story. And where 'n' is the number of total integers (100,000,000), both an array approach and a hash approach are O(n) operations. So it comes down to which approach is more computationally efficient. Even though an array lookup and a hash lookup are both O(1), it turns out the hash lookup requires significantly more cycles to perform.
Iterating over 100,000,000 integers and incrementing counters takes time. But it turns out that it takes more time doing so inside of a hash than it does in an array. I know this is sacrilege from a "common idioms" standpoint. But this may be a very special case where computational efficiency is more important than idiomatic code, and more important than the very slightly larger memory footprint of using an array as a counter.
Here is some code demonstrating what I'm talking about:
use strict;
use warnings;
use Benchmark qw/ cmpthese /;
use List::Util qw/ max min /;
use Test::More;
use Readonly;
Readonly my $datasize => 100_000_000;
Readonly my $test_size => 100_000;
my @array_results = keys count_in_array( $test_size );
my @hash_results = keys count_in_hash( $test_size );
is( max( @array_results ), 999, "max( keys count_in_array() ) in range." );
is( min( @array_results ), 1, "min( keys count_in_array() ) in range." );
is( max( @hash_results ), 999, "max( keys count_in_hash() ) in range." );
is( min( @hash_results ), 1, "min( keys count_in_hash() ) in range." );
done_testing();
cmpthese( 5, {
array => sub{ my $results = count_in_array() },
hash => sub{ my $results = count_in_hash() },
} );
sub count_in_array {
my @container;
for( 1 .. $datasize ) {
$container[ int( rand( 999 ) ) + 1 ]++;
}
return {
map{
$container[$_] > 0
? ( $_ => $container[$_] )
: ()
} 1 .. $#container
};
}
sub count_in_hash {
my %container;
for( 1 .. $datasize ) {
$container{ int( rand ( 999 ) ) + 1 }++;
}
return \%container;
}
And here are the results of that benchmark:
ok 1 - max( keys count_in_array() ) in range.
ok 2 - min( keys count_in_array() ) in range.
ok 3 - max( keys count_in_hash() ) in range.
ok 4 - min( keys count_in_hash() ) in range.
1..4
s/iter hash array
hash 24.9 -- -42%
array 14.5 72% --
That's a big win for the array approach (it is 72% faster). If that's still too slow, trim cycles by using Inline::C to rewrite the array approach using an array of ints. That will be an order of magnitude faster still.
This could be that critical 3% where optimization is warranted.
So what do we do to mitigate the effects of breaking away from a commonly recognized idiom? We make sure to document what we're doing so that future readers (including ourselves in the future) understand what is being done, and why it's being done in a way that isn't as familiar to us.
Just my .02.
$count{$number}++.