0

I have the following script where I'm parsing 2 csv files to find a MATCH the files have 10000 lines each one. But the processing is taking a long time!!! Is this normal?

My script:

#!/bin/bash 

IFS=$'\n'

CSV_FILE1=$1;
CSV_FILE2=$2;

sort -t';' $CSV_FILE1 >> Sorted_CSV1
sort -t';' $CSV_FILE2 >> Sorted_CSV2

echo "PATH1 ; NAME1 ; SIZE1 ; CKSUM1 ; PATH2 ; NAME2 ; SIZE2 ; CKSUM2"  >> 'mapping.csv'


while read lineCSV1 #Parse 1st CSV file
do

       PATH1=`echo $lineCSV1 | awk '{print $1}'`
       NAME1=`echo $lineCSV1 | awk '{print $3}'`
       SIZE1=`echo $lineCSV1 | awk '{print $7}'`
       CKSUM1=`echo $lineCSV1 | awk '{print $9}'`      

    while read lineCSV2   #Parse 2nd CSV file
    do
       PATH2=`echo $lineCSV2 | awk '{print $1}'`
       NAME2=`echo $lineCSV2 | awk '{print $3}'`
       SIZE2=`echo $lineCSV2 | awk '{print $7}'`
       CKSUM2=`echo $lineCSV2 | awk '{print $9}'`

       # Test if NAM1 MATCHS NAME2

        if [[ $NAME1 == $NAME2 ]]; then

            #Test checksum OF THE MATCHING NAME

                if [[ $CKSUM1 != $CKSUM2 ]]; then                   

            #MAPPING OF THE MATCHING LINES  
                echo $PATH1 ';' $NAME1 ';' $SIZE1 ';' $CKSUM1 ';' $PATH2 ';' $NAME2 ';' $SIZE2 ';' $CKSUM2 >> 'mapping.csv'
                fi
        break #When its a match break the while loop and go the the next Row of the 1st CSV File
        fi       
    done < Sorted_CSV2 #Done CSV2


done < Sorted_CSV1 #Done CSV1
4
  • 2
    Yes, you're calling awk several times for each line of your file and reading the entirety of Sorted_CSV2 for every line of Sorted_CSV1. Executing processes is slow. Instead of using two nested while read loops in bash, then calling external tools, you should do the whole thing in a more powerful language. If your CSV file is simple enough, then awk would be a good choice. Commented Mar 26, 2015 at 12:24
  • 2
    edit your question to show us a small sample of your data files and the desired output, then we'll be able to make some suggestions. Commented Mar 26, 2015 at 12:39
  • Note that as well as invoking awk many times, you're also reading the SortedCSV2 file completely for each line in SortedCSV1; that's a vastly expensive algorithm (so Yes, taking ages to process is normal if you do it like this). You need to use a merge-like technique where you read a line from CSV1 and from CSV2, and decide what to do: (1) If the values in CSV1 and CSV2 match, print; else (2) if the values CSV1 come before CSV2, read a new line from CSV1; else (3) (the values from CSV2 come before CSV1, so) read a new line from CSV2. Use awk for the whole job, with just a single run. Commented Mar 26, 2015 at 14:06
  • Just to put some numbers on the inefficiency: you will run awk 40,000 times for the values like PATH1 (which is a lot of times); you will run awk 400,000,000 times for the values like PATH2. Assume you can launch a thousand copies of awk each second; that's going to take four and a half days to run. Commented Mar 26, 2015 at 14:11

3 Answers 3

1

This is a quadratic order. Also, see Tom Fenech comment: You are calling awk several times inside a loop inside another loop. Instead of using awk for the fields in every line try setting the IFS shell variable to ";" and read the fields directly in read commands:

IFS=";"
while read FIELD11 FIELD12 FIELD13; do
    while read FIELD21 FIELD22 FIELD23; do
        ...
    done <Sorted_CSV2
done <Sorted_CSV1

Though, this would be still O(N^2) and very inefficient. It seems you are matching 2 fields by a coincident field. This task is easier and faster to accomplish by using join command line utility, and would reduce order from O(N^2) to O(N).

Sign up to request clarification or add additional context in comments.

4 Comments

You're still reading the second file each time a line of the first file is read. In Bash, a while read loop is notoriously slow. Besides you're not addressing other problems in the OP: why do you set IFS globally like so? OP opens and closes the file mapping.csv for each match; lacks of quotes; overall design problems etc.
That is a code excerpt, not a working answer. Besides, i wouldn't do that task by script, if i can use join instead.
Up-vote for the "O(N^2) quadratic behaviour" diagnosis. However, as already pointed out, the replacement code is also quadratic; it reads the CSV2 file once for each line in CSV2. The code needs to become O(N) linear like a merge process.
For merging the files in O(N), an external tool must be used, like join. The code snipet in my answers just ilustrates hoy to remove the O(N^2) calls to awk script, but this is still O(N^2). An external tool like join is the best approach here for matching lines on a same value. But without file samples, the right command is not obvious.
0

Whenever you say "Does this file/data list/table have something that matches this file/data list/table?", you should think of associative arrays (sometimes called hashes).

An associative array is keyed by a particular value and each key is associated with a value. The nice thing is that finding a key is extremely fast.

In your loop of a loop, you have 10,000 lines in each file. You're outer loop executed 10,000 times. Your inner loop may execute 10,000 times for each and every line in your first file. That's 10,000 x 10,000 times you go through that inner loop. That's potentially looping 100 million times through that inner loop. Think you can see why your program might be a little slow?

In this day and age, having a 10,000 member associative array isn't that bad. (Imagine doing this back in 1980 on a MS-DOS system with 256K. It just wouldn't work). So, let's go through the first file, create a 10,000 member associative array, and then go through the second file looking for matching lines.

Bash 4.x has associative arrays, but I only have Bash 3.2 on my system, so I can't really give you an answer in Bash.

Besides, sometimes Bash isn't the answer to a particular issue. Bash can be a bit slow and the syntax can be error prone. Awk might be faster, but many versions don't have associative arrays. This is really a job for a higher level scripting language like Python or Perl.

Since I can't do a Bash answer, here's a Perl answer. Maybe this will help. Or, maybe this will inspire someone who has Bash 4.x can give an answer in Bash.

I Basically open the first file and create an associative array keyed by the checksum. If this is a sha1 checksum, it should be unique for all files (unless they're an exact match). If you don't have a sha1 checksum, you'll need to massage the structure a wee bit, but it's pretty much the same idea.

Once I have the associative array figured out, I then open file #2 and simply see if the checksum already exists in the file. If it does, I know I have a matching line, and print out the two matches.

I have to loop 10,000 times in the first file, and 10,000 times in the second. That's only 20,000 loops instead of 10 million that's 20,000 times less looping which means the program will run 20,000 times faster. So, if it takes 2 full days for your program to run with a double loop, an associative array solution will work in less than one second.

#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;

use feature qw(say);

use constant {
    FILE1       => "file1.txt",
    FILE2       => "file2.txt",
    MATCHING    => "csv_matches.txt",
};

#
# Open the first file and create the associative array
#
my %file_data;

open my $fh1, "<", FILE1;

while ( my $line = <$fh1> ) {
    chomp $line;
    my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
    #
    # The main key is "check_sum" which **should** be unique, especially if it's a sha1
    #
    $file_data{$check_sum}->{PATH} = $path;
    $file_data{$check_sum}->{NAME} = $name;
    $file_data{$check_sum}->{SIZE} = $size;
}
close $fh1;

#
# Now, we have the associative array keyed by the data we want to match, read file 2
#

open my $fh2, "<", FILE2;
open my $csv_fh, ">", MATCHING;
while ( my $line = <$fh2> ) {
    chomp $line;
    my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
    #
    # If there is a matching checksum in file1, we know we have a matching entry
    #
    if ( exists $file_data{$check_sum} ) {
        printf {$csv_fh} "%s;%s:%s:%s:%s:%s\n",
           $file_data{$check_sum}->{PATH}, $file_data{$check_sum}->{NAME}, $file_data{$check_sum}->{SIZE},
           $path, $name, $size;
   }
}
close $fh2;
close $csv_fh;

BUGS

(A good manpage always list issues!)

  • This assumes one match per file. If you have multiple duplicates in file1 or file2, you will only pick up the last one.
  • This assumes a sha256 or equivalent checksum. In such a checksum, it is extremely unlikely that two files will have the same checksum unless they match. A 16bit checksum from the historic sum command may have collisions.

Comments

0

Although a proper database engine would make a much better tool for this, it is still very well possible to do it with awk.

The trick is to sort your data, so that records with the same name are grouped together. Then a single pass from top to bottom is enough to find the matches. This can be done in linear time.

In detail:

Insert two columns in both CSV files

Make sure every line starts with the name. Also add a number (either 1 or 2) which denotes from which file the line originates. We will need this when we merge the two files together.

awk -F';' '{ print $2 ";1;" $0 }' csvfile1 > tmpfile1
awk -F';' '{ print $2 ";2;" $0 }' csvfile2 > tmpfile2

Concatenate the files, then sort the lines

sort tmpfile1 tmpfile2 > tmpfile3

Scan the result, report the mismatches

awk -F';' -f scan.awk tmpfile3

Where scan.awk contains:

BEGIN {
    origin = 3;
}

$1 == name && $2 > origin && $6 != checksum {
    print record;
}

{
    name = $1;
    origin = $2;
    checksum = $6;
    sub(/^[^;]*;.;/, "");
    record = $0;
}

Putting it all together

Crammed together into a Bash oneliner, without explicit temporary files:

(awk -F';' '{print $2";1;"$0}' csvfile1 ; awk -F';' '{print $2";2;"$0}' csvfile2) | sort | awk -F';' 'BEGIN{origin=3}$1==name&&$2>origin&&$6!=checksum{print record}{name=$1;origin=$2;checksum=$6;sub(/^[^;]*;.;/,"");record=$0;}'

Notes:

  • If the same name appears more than once in csvfile1, then all but the last one are ignored.
  • If the same name appears more than once in csvfile2, then all but the first one are ignored.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.