2

Its a kind of bioinformatics concept but programmatic problem. I've tried a lot and at last I came here. I've reads like following.

ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC

Now what I want to do is, in a simplistic way,

take the last 6 residues of first read -> check if any other read is starting with those 6 residues, if yes add the last residue of that read to the first read -> again same with the 2nd read and so on.

Here is the code what I've tried so far.

#!/usr/bin/perl -w

use strict;
use warnings;

my $in = $ARGV[0];
open(IN, $in);
my @short_reads = <IN>;
my $first_read = $short_reads[0];
chomp $first_read;
my @all_end_res;


for(my $i=0; $i<=$#short_reads; $i++){
        chomp $short_reads[$i];
        my $end_of_kmers = substr($short_reads[$i], -6);
        if($short_reads[$i+1] =~ /^$end_of_kmers/){
                my $end_res = substr($short_reads[$i], -1);
                push(@all_end_res, $end_res);
        }

}

my $end_res2 = join('', @all_end_res);
print $first_read.$end_res2,"\n\n";

At the end I should get an output like ATGGAAGTCGCGGAATC but I'm getting ATGGAAGGTCGCGGAAT. The error must be in if, any help is greatly appreciated.

5
  • use warnings; should be complaining about something. Have you taken a closer look at the warning running this program yields? Commented Jul 24, 2015 at 6:29
  • (If you're not seeing anything: when I run it, at least, I get: Use of uninitialized value in pattern match (m//) at bio.pl line 16, <IN> line 11.) Commented Jul 24, 2015 at 6:30
  • Yep, I am also getting the same. Is this the wrong way to use a variable in regex? Commented Jul 24, 2015 at 6:33
  • It's because during the last iteration of the script $short_reads[$i+1] doesn't exist (and is undef) Commented Jul 24, 2015 at 6:36
  • $short_reads[$i+1] is uninitialized if $i is already $#short_reads (in your final loop iteration): you're reading past the end of the array. Commented Jul 24, 2015 at 6:36

2 Answers 2

2

There are three huge problems in IT.

  1. Naming of things.
  2. Off by one error.

And you just hit the second one. The problem is in a way you think about this task. You think in way I have this one string and if next one overlap I will add this one character to result. But correct way to think in this case I have this one string and if it overlaps with previous string or what I read so far, I will add one character or characters which are next.

#!/usr/bin/env perl
use strict;
use warnings;

use constant LENGTH => 6;

my $read = <>;
chomp $read;

while (<>) {
    chomp;
    last unless length > LENGTH;
    if ( substr( $read, -LENGTH() ) eq substr( $_, 0, LENGTH ) ) {
        $read .= substr( $_, LENGTH );
    }
    else {last}
}

print $read, "\n";

I didn't get this ARGV[0] thing. It is useless and inflexible.

$ chmod +x code.pl
$ cat data
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
$ ./code.pl data 
ATGGAAGTCGCGGAATC

But you have not defined what should happen if data doesn't overlap. Should there be some recovery or error? You can be also more strict

last unless length == LENGTH + 1;

Edit:

If you like working with an array you should try avoid using for(;;). It is prone to errors. (BTW for (my $i = 0; $i < @a; $i++) is more idiomatic.)

my @short_reads = <>;
chomp @short_reads;

my @all_end_res;
for my $i (1 .. $#short_reads) {
    my $prev_read = $short_reads[$i-1];
    my $curr_read = $short_reads[$i+1];
    my $end_of_kmers = substr($prev_read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        push @all_end_res, $1;
    }
}

print $short_reads[0], join('', @all_end_res), "\n";

The performance and memory difference is negligible up to thousands of lines. Now you can ask why to accumulate characters into an array instead of accumulate it to string.

my @short_reads = <>;
chomp @short_reads;

my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
    my $prev_read = $short_reads[$i-1];
    my $curr_read = $short_reads[$i+1];
    my $end_of_kmers = substr($prev_read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        $read .= $1;
    }
}

print "$read\n";

Now the question is why to use $prev_read when you have $end_of_kmers inside of $read.

my @short_reads = <>;
chomp @short_reads;

my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
    my $curr_read = $short_reads[$i+1];
    my $end_of_kmers = substr($read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        $read .= $1;
    }
}

print "$read\n";

Now you can ask why I need indexes at all. You just should remove the first line to work with the rest of array.

my @short_reads = <>;
chomp @short_reads;

my $read = shift @short_reads;
for my $curr_read (@short_reads) {
    my $end_of_kmers = substr($read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        $read .= $1;
    }
}

print "$read\n";

And with few more steps and tweaks you will end up with the code what I posted initially. I don't need an array at all because I look only to the current line and accumulator. The difference is in a way how you think about the problem. If you think in terms of arrays and indexes and looping or in terms of data flow, data processing and state/accumulator. With more experience, you don't have to do all those steps and make the final solution just due different approach to problem solving.

Edit2:

It is almost ten times faster using substr and eq then using regular expressions.

$ time ./code.pl data.out > data.test  

real    0m0.480s
user    0m0.468s
sys     0m0.008s
$ time ./code2.pl data.out > data2.test

real    0m4.520s
user    0m4.516s
sys     0m0.000s
$ cmp data.test data2.test && echo OK
OK
$ wc -c data.out data.test
6717368 data.out
839678 data.test
Sign up to request clarification or add additional context in comments.

8 Comments

Thank you. No need of doing anything if data doesn't overlap.
Out of interest, what's the 3rd problem in IT (you only give 2)?
@fugu: It is part of the joke. You tell kind of error and make one just on the spot. There is a version you start There are two huge problems and then tell three of them. Insert cache invalidation as second one and put off by one as third one. It is sad to have to explain jokes.
@Hynek-Pichi-Vychodil - gotcha! I'm not from an IT background so don't feel bad about explaining the joke (and I won't feel stupid for not understanding it). I wanted to say that you'd committed both errors in your statement, but thought that it would come across a bit snide
@Hynek-Pichi-Vychodil If I also want to print the data that is not overlapped, can I use the else statement to send the data into an array?
|
0

with minor modification:

use warnings;
use strict;


open my $in, '<', $ARGV[0] or die $!;
chomp(my @short_reads = <$in>);

my $first_read = $short_reads[0];

my @all_end_res;

for(my $i=0; $i<=$#short_reads; $i++){
        chomp $short_reads[$i];
        my $end_of_kmers = substr($short_reads[$i], -6);
        my ($next_read) = $short_reads[$i+1];
        if( (defined $next_read) and ($next_read =~ /^\Q$end_of_kmers/)){
                 my $end_res = substr($next_read, -1);
                 push(@all_end_res, $end_res);
         }

}

my $end_res2 = join('', @all_end_res);
print $first_read.$end_res2,"\n";

ATGGAAGTCGCGGAATC

6 Comments

it is only printing upto ATGGAAGTCGCGGAAT I don't find C at the end.
Try copying and pasting the whole program. Does it still not produce the desired output?
Name "main::DATA" used only once: possible typo at stack2.pl line 4. readline() on unopened filehandle DATA at stack2.pl line 4. Use of uninitialized value $first_read in concatenation (.) or string at stack2.pl line 22. This is what I'm getting.
@nEU - I don't know why it would throw that error - reading in from <DATA> always works for me. I've changed it back to how you originally had it, and it still produces the desired output
still I don't get the desired output, 'C' is missing at the end of result
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.