Looking for duplicates in arrays inside arrays (multi-dimensional array)

Question

I've gotten into a bit of a jam and was wondering if someone could clear it up. What I want to do is:

Opening a bunch of data containing .txt files
Creating a multidimensional array that holds @array[@filenames][@data]
Find which files are duplicates of eachother in terms of data

Here I slurp a file into a variable, use regex to obtain my data and put it into an array:

    while (my $row = <$fh>) {
        unless ($. == 0) {
            {
            local $/; # enable slurp
            @datalist = <$fh> =~ /\s*\d*\/\s*\d*\|\s*(.*?)\|.*?(?:.*?\|){4}\s*(\S*)\|(\S*).*\|/g; #extract article numbers # $1 = article number, $2 = quantity, $3 = unit
            }
            push(@arrayofarrays,[@datalist]);
            push(@filenames,$file);
            last;
            }
        }
        $numr++;
}
open(my $feh,">","test.txt");
print {$feh} Dumper \@arrayofarrays;

A Dumper shows that my data looks fine (pseudoresults to make it easy to read and short):

$VAR1 = [
          [
            'data type1',
            'data type2',
            'data type3',
            'data type1',
            'data type2',
            'data type3',
            ...
          ],
          [
            'data type1',
            'data type2',
            'data type3',
            ...
          ],
        ...
     ];

So I'm wondering if anyone knows an easy way to check for duplicates between sets of data? I know I can print individual data sets using

What I tried might give a better idea as to what I need to do:

my $i = 0;
my $j = 0;
while ( $i <= scalar @arrayofarrays) {
    $j = 0;
    while ( $j <= scalar @arrayofarrays) {
        if (@{$arrayofarrays[$i]} eq @{$arrayofarrays[$j]}) {
            print "\n'$filenames[$i]' is duplicate to '$filenames[$j]'.";
            } $j++;
        } $i++;
    }

Perhaps you could edit your question to show us what output you are expecting. Your sample code is rather confused. You are comparing the number of elements in each of your second level arrays, not any of the element. And you're comparing those numbers using a string comparison (eq) rather than a numeric comparison (==). — Dave Cross
– Dave Cross, Commented Mar 31, 2017 at 13:40
If you want to check which files are identical then forget about reading them all into memory. Just use Digest::MD5 to create a checksum for each of them and compare the results. — Borodin
– Borodin, Commented Mar 31, 2017 at 16:21

a1111exe · Accepted Answer · 2017-03-31 23:17:46Z

Instead of array of arrays I'd create a hash of arrays, producing keys from subarrays' data by flattening subarrays to strings optionally turning them to checksums (this would be appropriate for multidimensional subarrays). You may want to read this discussion on PerlMonks:

http://www.perlmonks.org/?node_id=1121378

The abstract example given an already existing array with duplicate data in subarrays (you may test it here on ideone.com):

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @array = (
    [1,'John','ABXC12132328'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322']
);
my %uniq_helper = ();
my @uniq_data = grep { !$uniq_helper{"@$_"}++ } @array;
print Dumper(\%uniq_helper) . "\n";
print Dumper(\@uniq_data) . "\n";

For your case it will probably look like this:

my %datalist;
while (my $row = <$fh>) {
    unless ($. == 0) {
        {
            local $/; # enable slurp
            @data = <$fh> =~ /\s*\d*\/\s*\d*\|\s*(.*?)\|.*?(?:.*?\|){4}\s*(\S*)\|(\S*).*\|/g; #extract article numbers # $1 = article number, $2 = quantity, $3 = unit
        }
        $datalist{"@data"} = \@data;
        push(@filenames,$file);
        last;
    }
}
$numr++;

Essex Boy · Accepted Answer · 2017-03-31 12:39:33Z

0

When you create the @dataList, create a key for it and check for that key before you do the push, something like:

my %checkHash=undef;
my $key=arrayKey(\@datalist);
if (!$checkHash{$key}) {
    push(@arrayofarrays,[@datalist]);
    push(@filenames,$file);
    $checkHash{$key}=1;
    last;
}

sub arrayKey($) {
    my $arrayRef = shift;
    my $output=undef;
    for (@$arrayRef) {
        if (ref($_) eq 'ARRAY') {
            $output.="[";
            $output.=arrayKey($_);
            $output.="]";
        }
        else {
            $output.="$_,";
        }
    }
    return $output;
}

answered Mar 31, 2017 at 12:39

Essex Boy

8,0812 gold badges23 silver badges28 bronze badges

1 Comment

Dave Cross Over a year ago

If the only parameter to arrayKey has to be an array reference, then surely \@ would be a better prototype than $. You could then call the subroutine as arrayKey(@datalist) and everything would still work. Mind you, like most people here, I think that prototypes are far more trouble than they're worth and I would never use them in examples aimed at beginners.

Collectives™ on Stack Overflow

Looking for duplicates in arrays inside arrays (multi-dimensional array)

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related