Difficulty in finding error in perl script

Question

Its a kind of bioinformatics concept but programmatic problem. I've tried a lot and at last I came here. I've reads like following.

ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC

Now what I want to do is, in a simplistic way,

take the last 6 residues of first read -> check if any other read is starting with those 6 residues, if yes add the last residue of that read to the first read -> again same with the 2nd read and so on.

Here is the code what I've tried so far.

#!/usr/bin/perl -w

use strict;
use warnings;

my $in = $ARGV[0];
open(IN, $in);
my @short_reads = <IN>;
my $first_read = $short_reads[0];
chomp $first_read;
my @all_end_res;


for(my $i=0; $i<=$#short_reads; $i++){
        chomp $short_reads[$i];
        my $end_of_kmers = substr($short_reads[$i], -6);
        if($short_reads[$i+1] =~ /^$end_of_kmers/){
                my $end_res = substr($short_reads[$i], -1);
                push(@all_end_res, $end_res);
        }

}

my $end_res2 = join('', @all_end_res);
print $first_read.$end_res2,"\n\n";

At the end I should get an output like ATGGAAGTCGCGGAATC but I'm getting ATGGAAGGTCGCGGAAT. The error must be in if, any help is greatly appreciated.

use warnings; should be complaining about something. Have you taken a closer look at the warning running this program yields? — lynn
– lynn, Commented Jul 24, 2015 at 6:29
(If you're not seeing anything: when I run it, at least, I get: Use of uninitialized value in pattern match (m//) at bio.pl line 16, <IN> line 11.) — lynn
– lynn, Commented Jul 24, 2015 at 6:30
Yep, I am also getting the same. Is this the wrong way to use a variable in regex? — nEU
– nEU, Commented Jul 24, 2015 at 6:33
It's because during the last iteration of the script $short_reads[$i+1] doesn't exist (and is undef) — fugu
– fugu, Commented Jul 24, 2015 at 6:36
$short_reads[$i+1] is uninitialized if $i is already $#short_reads (in your final loop iteration): you're reading past the end of the array. — lynn
– lynn, Commented Jul 24, 2015 at 6:36

Hynek -Pichi- Vychodil · Accepted Answer · 2015-07-24 13:43:57Z

2

There are three huge problems in IT.

Naming of things.
Off by one error.

And you just hit the second one. The problem is in a way you think about this task. You think in way I have this one string and if next one overlap I will add this one character to result. But correct way to think in this case I have this one string and if it overlaps with previous string or what I read so far, I will add one character or characters which are next.

#!/usr/bin/env perl
use strict;
use warnings;

use constant LENGTH => 6;

my $read = <>;
chomp $read;

while (<>) {
    chomp;
    last unless length > LENGTH;
    if ( substr( $read, -LENGTH() ) eq substr( $_, 0, LENGTH ) ) {
        $read .= substr( $_, LENGTH );
    }
    else {last}
}

print $read, "\n";

I didn't get this ARGV[0] thing. It is useless and inflexible.

$ chmod +x code.pl
$ cat data
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
$ ./code.pl data 
ATGGAAGTCGCGGAATC

But you have not defined what should happen if data doesn't overlap. Should there be some recovery or error? You can be also more strict

last unless length == LENGTH + 1;

Edit:

If you like working with an array you should try avoid using for(;;). It is prone to errors. (BTW for (my $i = 0; $i < @a; $i++) is more idiomatic.)

my @short_reads = <>;
chomp @short_reads;

my @all_end_res;
for my $i (1 .. $#short_reads) {
    my $prev_read = $short_reads[$i-1];
    my $curr_read = $short_reads[$i+1];
    my $end_of_kmers = substr($prev_read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        push @all_end_res, $1;
    }
}

print $short_reads[0], join('', @all_end_res), "\n";

The performance and memory difference is negligible up to thousands of lines. Now you can ask why to accumulate characters into an array instead of accumulate it to string.

my @short_reads = <>;
chomp @short_reads;

my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
    my $prev_read = $short_reads[$i-1];
    my $curr_read = $short_reads[$i+1];
    my $end_of_kmers = substr($prev_read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        $read .= $1;
    }
}

print "$read\n";

Now the question is why to use $prev_read when you have $end_of_kmers inside of $read.

my @short_reads = <>;
chomp @short_reads;

my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
    my $curr_read = $short_reads[$i+1];
    my $end_of_kmers = substr($read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        $read .= $1;
    }
}

print "$read\n";

Now you can ask why I need indexes at all. You just should remove the first line to work with the rest of array.

my @short_reads = <>;
chomp @short_reads;

my $read = shift @short_reads;
for my $curr_read (@short_reads) {
    my $end_of_kmers = substr($read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        $read .= $1;
    }
}

print "$read\n";

And with few more steps and tweaks you will end up with the code what I posted initially. I don't need an array at all because I look only to the current line and accumulator. The difference is in a way how you think about the problem. If you think in terms of arrays and indexes and looping or in terms of data flow, data processing and state/accumulator. With more experience, you don't have to do all those steps and make the final solution just due different approach to problem solving.

Edit2:

It is almost ten times faster using substr and eq then using regular expressions.

$ time ./code.pl data.out > data.test  

real    0m0.480s
user    0m0.468s
sys     0m0.008s
$ time ./code2.pl data.out > data2.test

real    0m4.520s
user    0m4.516s
sys     0m0.000s
$ cmp data.test data2.test && echo OK
OK
$ wc -c data.out data.test
6717368 data.out
839678 data.test

edited Jul 24, 2015 at 13:43

answered Jul 24, 2015 at 7:30

Hynek -Pichi- Vychodil

26.2k5 gold badges55 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

nEU Over a year ago

Thank you. No need of doing anything if data doesn't overlap.

fugu Over a year ago

Out of interest, what's the 3rd problem in IT (you only give 2)?

Hynek -Pichi- Vychodil Over a year ago

@fugu: It is part of the joke. You tell kind of error and make one just on the spot. There is a version you start There are two huge problems and then tell three of them. Insert cache invalidation as second one and put off by one as third one. It is sad to have to explain jokes.

fugu Over a year ago

@Hynek-Pichi-Vychodil - gotcha! I'm not from an IT background so don't feel bad about explaining the joke (and I won't feel stupid for not understanding it). I wanted to say that you'd committed both errors in your statement, but thought that it would come across a bit snide

nEU Over a year ago

@Hynek-Pichi-Vychodil If I also want to print the data that is not overlapped, can I use the else statement to send the data into an array?

|

fugu · Accepted Answer · 2015-07-24 07:22:25Z

0

with minor modification:

use warnings;
use strict;


open my $in, '<', $ARGV[0] or die $!;
chomp(my @short_reads = <$in>);

my $first_read = $short_reads[0];

my @all_end_res;

for(my $i=0; $i<=$#short_reads; $i++){
        chomp $short_reads[$i];
        my $end_of_kmers = substr($short_reads[$i], -6);
        my ($next_read) = $short_reads[$i+1];
        if( (defined $next_read) and ($next_read =~ /^\Q$end_of_kmers/)){
                 my $end_res = substr($next_read, -1);
                 push(@all_end_res, $end_res);
         }

}

my $end_res2 = join('', @all_end_res);
print $first_read.$end_res2,"\n";

ATGGAAGTCGCGGAATC

edited Jul 24, 2015 at 7:22

answered Jul 24, 2015 at 6:46

fugu

6,6129 gold badges42 silver badges82 bronze badges

6 Comments

nEU Over a year ago

it is only printing upto ATGGAAGTCGCGGAAT I don't find C at the end.

fugu Over a year ago

Try copying and pasting the whole program. Does it still not produce the desired output?

nEU Over a year ago

Name "main::DATA" used only once: possible typo at stack2.pl line 4. readline() on unopened filehandle DATA at stack2.pl line 4. Use of uninitialized value $first_read in concatenation (.) or string at stack2.pl line 22.

This is what I'm getting.

fugu Over a year ago

@nEU - I don't know why it would throw that error - reading in from <DATA> always works for me. I've changed it back to how you originally had it, and it still produces the desired output

nEU Over a year ago

still I don't get the desired output, 'C' is missing at the end of result

|

Collectives™ on Stack Overflow

Difficulty in finding error in perl script

2 Answers 2

8 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related