1

The following is the script for finding consecutive substrings in strings.

use strict;
use warnings;

my $file="Sample.txt";
open(DAT, $file) || die("Could not open file!");

#worry about these later
#my $regexp1 = "motif1";
#my $regexp2 = "motif2";
#my $regexp3 = "motif3";
#my $regexp4 = "motif4";

my $sequence;

while (my $line = <DAT>) {
    if ($line=~ /(HDWFLSFKD)/g){
        {
        print "its found index location: ",
        pos($line), "-",  pos($line)+length($1), "\n";        
        }
        if ($line=~ /(HD)/g){
                print "motif found and its locations is: \n";
                pos($line), "-", pos($line)+length($1), "\n\n";
                }
                if ($line=~ /(K)/g){
                        print "motif found and its location is: \n";
                        pos($line), "-",pos($line)+length($1), "\n\n";
                        }
                        if ($line=~ /(DD)/g){
                                print "motif found and its location is: \n";
                                pos($line), "-", pos($line)+length($1), "\n\n";
                                }
}else {
        $sequence .= $line;
        print "came in else\n";
    }
}

It matches substring1 with string and prints out position where substring1 matched. The problem lies in finding the rest of the substrings. For substrings2 it starts again from the beginning of the string (instead of starting from the position where substring1 was found). The problem is that every time it calculates position it starts from the beginning of string instead of starting from the position of the previously found substring. Since substrings are consecutive substring1, substring2, substring3, substring4, their positions have to occur after the previous respectively.

6
  • 1
    Your code seems to have some problems with opening and closing brackets. Try to fix your indentation in order to sort this out. Commented May 7, 2009 at 13:08
  • 1
    Most of the Perl questions these days revolve around regular expressions in different contexts. Can we find a way to answer one such question decently and close all the others as dupes? Commented May 7, 2009 at 13:17
  • Reading perldoc perlretut once or twice and consulting perldoc perlreref as the need arises would yield the answer to most of these questions immediately. Commented May 7, 2009 at 13:31
  • Yes, but "RTFM" is not a reason for closing questions. Commented May 7, 2009 at 13:40
  • agreed with @Manni. Someone with high reputation should make a good detailed wiki-page about regex and perl. Commented May 7, 2009 at 17:23

5 Answers 5

2

Try this perl program

use strict;
use warnings;
use feature qw'say';

my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");

my @regex = qw(
  HDWFLSFKD
  HD
  K
  DD
);

my $sequence;

while( my $line = <$dat> ){
  chomp $line;
  
  say 'Line: ', $.;
  
  # reset the position of variable $line
  # pos is an lvalue subroutine
  pos $line = 0;
  
  for my $regex ( @regex ){
    $regex = quotemeta $regex;
    
    if( scalar $line =~ / \G (.*?) ($regex) /xg ){
      say $regex, ' found at location (', $-[2], '-', $+[2], ')';
      if( $1 ){
        say "    but skipped: \"$1\" at location ($-[1]-$+[1])";
      }
    }else{
      say 'Unable to find ', $regex;
      
      # end loop
      last;
    }
  }
}
Sign up to request clarification or add additional context in comments.

3 Comments

that is awesome and much less lines of code! gah I'm starting to love perl!
It is also easy to add more regexes, if needed.
it would help me (and maybe to OP) if you added some comments explaining what is going on. it looks great but since atleast I'm a n00b to perl I dont get some of the lines..e.g. use feature qw'say';
1

I'm not a perl expert but you can use $- and $+ to track index location for last regex match found.
Below is code built on top of your code that explains this.

use strict;
use warnings;


my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");

open (OUTPUTFILE, '>data.txt');

my $sequence;
my $someVar = 0;
my $sequenceNums = 1;

my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";

while (my $line = <DAT>) 
{
    $someVar = 0;
    print "\nSequence $sequenceNums: $line\n";
    print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
        if ($line=~ /$motif1/g)
        {
                &printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
                $someVar = 1;
        }


        if ($line=~ /$motif2/g and $someVar == 1)
        {
                &printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
                $someVar = 2;
        }

        if ($line=~ /$motif3/g and $someVar == 2)
        {
                &printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
                $someVar = 3;
        }

        if ($line=~ /$motif4/g and $someVar == 3)
        {
                &printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
        }

        else 
        {
            $sequence .= $line;

            if ($someVar == 0)
            {
                &printWrongStuff($sequenceNums, "motif1", $motif1);
            }
            elsif ($someVar == 1)
            {
            &printWrongStuff($sequenceNums, "motif2", $motif2);
            }
            elsif ($someVar == 2)
            {
            &printWrongStuff($sequenceNums, "motif3", $motif3);
            }
            elsif ($someVar == 3)
            {
            &printWrongStuff($sequenceNums, "motif4", $motif4);
            }
        }
        $sequenceNums++;
}

sub printStuff
{
            print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
            print OUTPUTFILE "Sequence: $_[0]  $_[1]: $_[2] index location: $_[3]\n";
}

sub printWrongStuff
{
            print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
            print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n";    

}

close (OUTPUTFILE);
close (DAT);

Sample input:

MLTSHQKKFHDWFLSFKDSNNYNHDSKQNHSIKDDIFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSKHDWFLSFKDQNHSIKDIFNRFNHYIYNDL

Comments

1

You really should read

You need the special variables @- and @+ if you need the positions. No need to try to compute them yourself.

#!/usr/bin/perl

use strict;
use warnings;

use List::MoreUtils qw( each_array );

my $source = 'AAAA   BBCCC   DD  E      FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );



if ( $source =~ /$pattern/ ) {
    my $it = each_array @-, @+;

    $it->(); # discard overall match information;

    while ( my ($start, $end) = $it->() ) {
        printf "Start: %d - Length: %d\n", $start, $end - $start;
    }
}

Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5

Comments

0

The result of a construct like

$line=~ /(HD)/g

is a list. Use while to step through the hits.

1 Comment

The OP needs the start positions as well as the lengths.
0

To match where the last match left off, use \G. perldoc perlre says (but consult your own installation's version's manual first):

The "\G" assertion can be used to chain global matches (using "m//g"), as described in "Regexp Quote-Like Operators" in perlop. It is also useful when writing "lex"-like scanners, when you have several patterns that you want to match against consequent substrings of your string, see the previous reference. The actual location where "\G" will match can also be influenced by using "pos()" as an lvalue: see "pos" in perlfunc. Note that the rule for zero-length matches is modified somewhat, in that contents to the left of "\G" is not counted when determining the length of the match. Thus the following will not match forever:

$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
    print $&;
}

2 Comments

Why the gratuitous use of $& ? See perldoc perlreref: "The use of $`, $& or $' will slow down all regex use within your program. Consult perlvar for "@-" to see equivalent expressions that won't cause slow down."
Probably to make an illustrative point. Take it up with the perlre author.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.