0

I've gotten into a bit of a jam and was wondering if someone could clear it up. What I want to do is:

  1. Opening a bunch of data containing .txt files
  2. Creating a multidimensional array that holds @array[@filenames][@data]
  3. Find which files are duplicates of eachother in terms of data

Here I slurp a file into a variable, use regex to obtain my data and put it into an array:

    while (my $row = <$fh>) {
        unless ($. == 0) {
            {
            local $/; # enable slurp
            @datalist = <$fh> =~ /\s*\d*\/\s*\d*\|\s*(.*?)\|.*?(?:.*?\|){4}\s*(\S*)\|(\S*).*\|/g; #extract article numbers # $1 = article number, $2 = quantity, $3 = unit
            }
            push(@arrayofarrays,[@datalist]);
            push(@filenames,$file);
            last;
            }
        }
        $numr++;
}
open(my $feh,">","test.txt");
print {$feh} Dumper \@arrayofarrays;

A Dumper shows that my data looks fine (pseudoresults to make it easy to read and short):

$VAR1 = [
          [
            'data type1',
            'data type2',
            'data type3',
            'data type1',
            'data type2',
            'data type3',
            ...
          ],
          [
            'data type1',
            'data type2',
            'data type3',
            ...
          ],
        ...
     ];

So I'm wondering if anyone knows an easy way to check for duplicates between sets of data? I know I can print individual data sets using

What I tried might give a better idea as to what I need to do:

my $i = 0;
my $j = 0;
while ( $i <= scalar @arrayofarrays) {
    $j = 0;
    while ( $j <= scalar @arrayofarrays) {
        if (@{$arrayofarrays[$i]} eq @{$arrayofarrays[$j]}) {
            print "\n'$filenames[$i]' is duplicate to '$filenames[$j]'.";
            } $j++;
        } $i++;
    }
2
  • Perhaps you could edit your question to show us what output you are expecting. Your sample code is rather confused. You are comparing the number of elements in each of your second level arrays, not any of the element. And you're comparing those numbers using a string comparison (eq) rather than a numeric comparison (==). Commented Mar 31, 2017 at 13:40
  • 2
    If you want to check which files are identical then forget about reading them all into memory. Just use Digest::MD5 to create a checksum for each of them and compare the results. Commented Mar 31, 2017 at 16:21

2 Answers 2

1

Instead of array of arrays I'd create a hash of arrays, producing keys from subarrays' data by flattening subarrays to strings optionally turning them to checksums (this would be appropriate for multidimensional subarrays). You may want to read this discussion on PerlMonks:

http://www.perlmonks.org/?node_id=1121378

The abstract example given an already existing array with duplicate data in subarrays (you may test it here on ideone.com):

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @array = (
    [1,'John','ABXC12132328'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322'],
    [0,'John','ABXC12132322']
);
my %uniq_helper = ();
my @uniq_data = grep { !$uniq_helper{"@$_"}++ } @array;
print Dumper(\%uniq_helper) . "\n";
print Dumper(\@uniq_data) . "\n";

For your case it will probably look like this:

my %datalist;
while (my $row = <$fh>) {
    unless ($. == 0) {
        {
            local $/; # enable slurp
            @data = <$fh> =~ /\s*\d*\/\s*\d*\|\s*(.*?)\|.*?(?:.*?\|){4}\s*(\S*)\|(\S*).*\|/g; #extract article numbers # $1 = article number, $2 = quantity, $3 = unit
        }
        $datalist{"@data"} = \@data;
        push(@filenames,$file);
        last;
    }
}
$numr++;
Sign up to request clarification or add additional context in comments.

Comments

0

When you create the @dataList, create a key for it and check for that key before you do the push, something like:

my %checkHash=undef;
my $key=arrayKey(\@datalist);
if (!$checkHash{$key}) {
    push(@arrayofarrays,[@datalist]);
    push(@filenames,$file);
    $checkHash{$key}=1;
    last;
}

sub arrayKey($) {
    my $arrayRef = shift;
    my $output=undef;
    for (@$arrayRef) {
        if (ref($_) eq 'ARRAY') {
            $output.="[";
            $output.=arrayKey($_);
            $output.="]";
        }
        else {
            $output.="$_,";
        }
    }
    return $output;
}

1 Comment

If the only parameter to arrayKey has to be an array reference, then surely \@ would be a better prototype than $. You could then call the subroutine as arrayKey(@datalist) and everything would still work. Mind you, like most people here, I think that prototypes are far more trouble than they're worth and I would never use them in examples aimed at beginners.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.