Perl array syntax

Question

I have files with 1e8 numbers in them ranging in size from 1 to 999. I need to read through each file and save a report of how many of each number were found in each file. I thought that setting up a constant array of all zeros and then incrementing using the number I just read as the index would give me the answer. Perl syntax for doing this is not what I expected. Not all numbers will necessarily occur. Maybe hashes is a way to go but the array will likely only have a few holes in it. Any help? Thanks.

If this is homework, please tag it as such. It would probably be easier to write the code for you than to explain it. If you show us what you've tried, we can help. — Keith Thompson
– Keith Thompson, Commented Nov 13, 2011 at 23:45
Don't worry much about having a hash that feels "full" or otherwise wasted. Perl makes hashes so easy to use that I'd probably pick them in this case over arrays indexed by integers just because its so simple to stick with hashes. This isn't really enough data to stress most machines from the last decade anyway. — sarnold
– sarnold, Commented Nov 13, 2011 at 23:57
that said, there's no reason an array won't work... what did you try? — ysth
– ysth, Commented Nov 14, 2011 at 5:08

showyourwork · Accepted Answer · 2011-11-14 02:08:20Z

3

Perl syntax for doing this is not what I expected.

Show your work

Not all numbers will necessarily occur.

Skip the zeros (simple if block, see http://perldoc.perl.org/perlintro.html )

Maybe hashes is a way to go but the array will likely only have a few holes in it.

Yes, hashes are a natural fit, see perlfaq4, search for "count", "uniq", and "duplicate"

answered Nov 14, 2011 at 2:08

showyourwork

311 bronze badge

Sign up to request clarification or add additional context in comments.

Comments

DavidO · Accepted Answer · 2011-11-14 17:12:34Z

Tom Duff (creator of Duff's Device): "If your code is too slow, you must make it faster. If no better algorithm is available, you must trim cycles."

I disagree that this is a situation where a hash is the best fit. Sure, a hash is the idiomatic fit, it's the method mentioned in perlfaq4, and it doesn't waste elements in your counter container. But he's talking about 100_000_000 integers between 1 and 999. The number of elements being used in a counter container is insignificant. The number of iterations necessary to get a count is very significant. 100,000,000 iterations takes a lot of time.

If we use an array instead, we throw away the element indexed at zero. And if all of the integers are the same value, we throw away 998 more elements. Is that such a big deal? On the other hand, even though both indexing into an array, and indexing into a hash aggregate out to O(1) operations, Big-O notation only tells part of the story. And where 'n' is the number of total integers (100,000,000), both an array approach and a hash approach are O(n) operations. So it comes down to which approach is more computationally efficient. Even though an array lookup and a hash lookup are both O(1), it turns out the hash lookup requires significantly more cycles to perform.

Iterating over 100,000,000 integers and incrementing counters takes time. But it turns out that it takes more time doing so inside of a hash than it does in an array. I know this is sacrilege from a "common idioms" standpoint. But this may be a very special case where computational efficiency is more important than idiomatic code, and more important than the very slightly larger memory footprint of using an array as a counter.

Here is some code demonstrating what I'm talking about:

use strict;
use warnings;
use Benchmark  qw/ cmpthese /;
use List::Util qw/ max min  /;
use Test::More;
use Readonly;



Readonly my $datasize => 100_000_000;

Readonly my $test_size => 100_000;

my @array_results = keys count_in_array( $test_size );
my @hash_results  = keys count_in_hash(  $test_size );

is( max( @array_results ), 999, "max( keys count_in_array() ) in range." );
is( min( @array_results ),   1, "min( keys count_in_array() ) in range." );
is( max( @hash_results  ), 999, "max( keys  count_in_hash() ) in range." );
is( min( @hash_results  ),   1, "min( keys  count_in_hash() ) in range." );
done_testing();    



cmpthese( 5, {
    array => sub{ my $results = count_in_array() },
    hash  => sub{ my $results = count_in_hash()  },
} );




sub count_in_array {
    my @container;
    for( 1 .. $datasize ) {
        $container[ int( rand( 999 ) ) + 1 ]++;
    }
    return {
        map{ 
                $container[$_] > 0 
                    ? ( $_ => $container[$_] ) 
                    : () 
        } 1 .. $#container
    };
}



sub count_in_hash {
    my %container;
    for( 1 .. $datasize ) {
        $container{ int( rand ( 999 ) ) + 1 }++;
    }
    return \%container;
}

And here are the results of that benchmark:

ok 1 - max( keys count_in_array() ) in range.
ok 2 - min( keys count_in_array() ) in range.
ok 3 - max( keys  count_in_hash() ) in range.
ok 4 - min( keys  count_in_hash() ) in range.
1..4
      s/iter  hash array
hash    24.9    --  -42%
array   14.5   72%    --

That's a big win for the array approach (it is 72% faster). If that's still too slow, trim cycles by using Inline::C to rewrite the array approach using an array of ints. That will be an order of magnitude faster still.

This could be that critical 3% where optimization is warranted.

So what do we do to mitigate the effects of breaking away from a commonly recognized idiom? We make sure to document what we're doing so that future readers (including ourselves in the future) understand what is being done, and why it's being done in a way that isn't as familiar to us.

Just my .02.

Bill Ruppert · Accepted Answer · 2011-11-14 13:53:12Z

0

Depending on the quality of your system's sort utility, at the command line:

sort giant-file.txt | uniq -c > giant_file_counts.txt

answered Nov 14, 2011 at 13:53

Bill Ruppert

9,0567 gold badges30 silver badges44 bronze badges

Collectives™ on Stack Overflow

Perl array syntax

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related