There are three huge problems in IT.
- Naming of things.
- Off by one error.
And you just hit the second one. The problem is in a way you think about this task. You think in way I have this one string and if next one overlap I will add this one character to result. But correct way to think in this case I have this one string and if it overlaps with previous string or what I read so far, I will add one character or characters which are next.
#!/usr/bin/env perl
use strict;
use warnings;
use constant LENGTH => 6;
my $read = <>;
chomp $read;
while (<>) {
chomp;
last unless length > LENGTH;
if ( substr( $read, -LENGTH() ) eq substr( $_, 0, LENGTH ) ) {
$read .= substr( $_, LENGTH );
}
else {last}
}
print $read, "\n";
I didn't get this ARGV[0] thing. It is useless and inflexible.
$ chmod +x code.pl
$ cat data
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
$ ./code.pl data
ATGGAAGTCGCGGAATC
But you have not defined what should happen if data doesn't overlap. Should there be some recovery or error? You can be also more strict
last unless length == LENGTH + 1;
Edit:
If you like working with an array you should try avoid using for(;;). It is prone to errors. (BTW for (my $i = 0; $i < @a; $i++) is more idiomatic.)
my @short_reads = <>;
chomp @short_reads;
my @all_end_res;
for my $i (1 .. $#short_reads) {
my $prev_read = $short_reads[$i-1];
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($prev_read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
push @all_end_res, $1;
}
}
print $short_reads[0], join('', @all_end_res), "\n";
The performance and memory difference is negligible up to thousands of lines. Now you can ask why to accumulate characters into an array instead of accumulate it to string.
my @short_reads = <>;
chomp @short_reads;
my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
my $prev_read = $short_reads[$i-1];
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($prev_read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
Now the question is why to use $prev_read when you have $end_of_kmers inside of $read.
my @short_reads = <>;
chomp @short_reads;
my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
Now you can ask why I need indexes at all. You just should remove the first line to work with the rest of array.
my @short_reads = <>;
chomp @short_reads;
my $read = shift @short_reads;
for my $curr_read (@short_reads) {
my $end_of_kmers = substr($read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
And with few more steps and tweaks you will end up with the code what I posted initially. I don't need an array at all because I look only to the current line and accumulator. The difference is in a way how you think about the problem. If you think in terms of arrays and indexes and looping or in terms of data flow, data processing and state/accumulator. With more experience, you don't have to do all those steps and make the final solution just due different approach to problem solving.
Edit2:
It is almost ten times faster using substr and eq then using regular expressions.
$ time ./code.pl data.out > data.test
real 0m0.480s
user 0m0.468s
sys 0m0.008s
$ time ./code2.pl data.out > data2.test
real 0m4.520s
user 0m4.516s
sys 0m0.000s
$ cmp data.test data2.test && echo OK
OK
$ wc -c data.out data.test
6717368 data.out
839678 data.test
use warnings;should be complaining about something. Have you taken a closer look at the warning running this program yields?Use of uninitialized value in pattern match (m//) at bio.pl line 16, <IN> line 11.)$short_reads[$i+1]doesn't exist (and isundef)$short_reads[$i+1]is uninitialized if$iis already$#short_reads(in your final loop iteration): you're reading past the end of the array.