I have multiple tab delimited fastq files. I want to match the second line of each read and add up the values next to it if it matches. for example:
file1.fq
>1
ATGCCGTT file1:1
+
HHHHKKKK
file2.fq
>2
ATGCCGTT file2:3
+
JJKHHTTT
>3
ATTCCAAC file2:1
+
=#GJLMNB
The output I want is like:
output.txt
ATGCCGTT file1:1 file2:3 count:4
ATTCCAAC file2:1 count:1
The code I have written is:
#!/usr/bin/env perl
use strict;
use warnings;
no warnings qw( numeric );
my %seen;
$/ = "";
while () {
chomp;
my ($key, $value) = split ('\t', $_);
my @lines = split /\n/, $key;
my $key1 = $lines[1];
$seen{$key1} //= [ $key ];
push (@{$seen{$key1}}, $value);
}
foreach my $key1 ( sort keys %seen ) {
my $tot = 0;
my $file_count = @ARGV;
for my $val ( @{$seen{$key1}} ) {
$tot += ( split /:/, $val )[0];
}
if ( @{ $seen{$key1} } >= $file_count) {
print join( "\t", @{$seen{$key1}});
print "\tcount:". $tot."\n\n";
}
}
This code works well for small files but when I want to compare large files it occupies the whole memory resulting into the script running without results. I want to modify the script so that it does not occupy memory. I don't want to use any modules. I think if I load only one file in memory at a time, it will save memory but unable to do it. Please help modifying my script.