Find multiple substrings in strings and record location

Question

The following is the script for finding consecutive substrings in strings.

use strict;
use warnings;

my $file="Sample.txt";
open(DAT, $file) || die("Could not open file!");

#worry about these later
#my $regexp1 = "motif1";
#my $regexp2 = "motif2";
#my $regexp3 = "motif3";
#my $regexp4 = "motif4";

my $sequence;

while (my $line = <DAT>) {
    if ($line=~ /(HDWFLSFKD)/g){
        {
        print "its found index location: ",
        pos($line), "-",  pos($line)+length($1), "\n";        
        }
        if ($line=~ /(HD)/g){
                print "motif found and its locations is: \n";
                pos($line), "-", pos($line)+length($1), "\n\n";
                }
                if ($line=~ /(K)/g){
                        print "motif found and its location is: \n";
                        pos($line), "-",pos($line)+length($1), "\n\n";
                        }
                        if ($line=~ /(DD)/g){
                                print "motif found and its location is: \n";
                                pos($line), "-", pos($line)+length($1), "\n\n";
                                }
}else {
        $sequence .= $line;
        print "came in else\n";
    }
}

It matches substring1 with string and prints out position where substring1 matched. The problem lies in finding the rest of the substrings. For substrings2 it starts again from the beginning of the string (instead of starting from the position where substring1 was found). The problem is that every time it calculates position it starts from the beginning of string instead of starting from the position of the previously found substring. Since substrings are consecutive substring1, substring2, substring3, substring4, their positions have to occur after the previous respectively.

Your code seems to have some problems with opening and closing brackets. Try to fix your indentation in order to sort this out. — Svante
– Svante, Commented May 7, 2009 at 13:08
Most of the Perl questions these days revolve around regular expressions in different contexts. Can we find a way to answer one such question decently and close all the others as dupes? — innaM
– innaM, Commented May 7, 2009 at 13:17
Reading perldoc perlretut once or twice and consulting perldoc perlreref as the need arises would yield the answer to most of these questions immediately. — Sinan Ünür
– Sinan Ünür, Commented May 7, 2009 at 13:31
agreed with @Manni. Someone with high reputation should make a good detailed wiki-page about regex and perl. — Omnipresent
– Omnipresent, Commented May 7, 2009 at 17:23

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

Try this perl program

use strict;
use warnings;
use feature qw'say';

my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");

my @regex = qw(
  HDWFLSFKD
  HD
  K
  DD
);

my $sequence;

while( my $line = <$dat> ){
  chomp $line;
  
  say 'Line: ', $.;
  
  # reset the position of variable $line
  # pos is an lvalue subroutine
  pos $line = 0;
  
  for my $regex ( @regex ){
    $regex = quotemeta $regex;
    
    if( scalar $line =~ / \G (.*?) ($regex) /xg ){
      say $regex, ' found at location (', $-[2], '-', $+[2], ')';
      if( $1 ){
        say "    but skipped: \"$1\" at location ($-[1]-$+[1])";
      }
    }else{
      say 'Unable to find ', $regex;
      
      # end loop
      last;
    }
  }
}

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered May 7, 2009 at 19:47

Brad Gilbert

34.2k11 gold badges80 silver badges131 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Omnipresent Over a year ago

that is awesome and much less lines of code! gah I'm starting to love perl!

Brad Gilbert Over a year ago

It is also easy to add more regexes, if needed.

Omnipresent Over a year ago

it would help me (and maybe to OP) if you added some comments explaining what is going on. it looks great but since atleast I'm a n00b to perl I dont get some of the lines..e.g. use feature qw'say';

Omnipresent · Accepted Answer · 2009-05-07 16:54:34Z

I'm not a perl expert but you can use $- and $+ to track index location for last regex match found.
Below is code built on top of your code that explains this.

use strict;
use warnings;


my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");

open (OUTPUTFILE, '>data.txt');

my $sequence;
my $someVar = 0;
my $sequenceNums = 1;

my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";

while (my $line = <DAT>) 
{
    $someVar = 0;
    print "\nSequence $sequenceNums: $line\n";
    print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
        if ($line=~ /$motif1/g)
        {
                &printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
                $someVar = 1;
        }


        if ($line=~ /$motif2/g and $someVar == 1)
        {
                &printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
                $someVar = 2;
        }

        if ($line=~ /$motif3/g and $someVar == 2)
        {
                &printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
                $someVar = 3;
        }

        if ($line=~ /$motif4/g and $someVar == 3)
        {
                &printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
        }

        else 
        {
            $sequence .= $line;

            if ($someVar == 0)
            {
                &printWrongStuff($sequenceNums, "motif1", $motif1);
            }
            elsif ($someVar == 1)
            {
            &printWrongStuff($sequenceNums, "motif2", $motif2);
            }
            elsif ($someVar == 2)
            {
            &printWrongStuff($sequenceNums, "motif3", $motif3);
            }
            elsif ($someVar == 3)
            {
            &printWrongStuff($sequenceNums, "motif4", $motif4);
            }
        }
        $sequenceNums++;
}

sub printStuff
{
            print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
            print OUTPUTFILE "Sequence: $_[0]  $_[1]: $_[2] index location: $_[3]\n";
}

sub printWrongStuff
{
            print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
            print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n";    

}

close (OUTPUTFILE);
close (DAT);

Sample input:

MLTSHQKKFHDWFLSFKDSNNYNHDSKQNHSIKDDIFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSKHDWFLSFKDQNHSIKDIFNRFNHYIYNDL

Brad Gilbert · Accepted Answer · 2009-05-07 21:18:27Z

1

You really should read

You need the special variables @- and @+ if you need the positions. No need to try to compute them yourself.

#!/usr/bin/perl

use strict;
use warnings;

use List::MoreUtils qw( each_array );

my $source = 'AAAA   BBCCC   DD  E      FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );



if ( $source =~ /$pattern/ ) {
    my $it = each_array @-, @+;

    $it->(); # discard overall match information;

    while ( my ($start, $end) = $it->() ) {
        printf "Start: %d - Length: %d\n", $start, $end - $start;
    }
}

Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5

edited May 7, 2009 at 21:18

Brad Gilbert

34.2k11 gold badges80 silver badges131 bronze badges

answered May 7, 2009 at 13:00

Sinan Ünür

118k15 gold badges201 silver badges347 bronze badges

Comments

Fritz G. Mehner · Accepted Answer · 2009-05-07 13:01:36Z

0

The result of a construct like

$line=~ /(HD)/g

is a list. Use while to step through the hits.

answered May 7, 2009 at 13:01

Fritz G. Mehner

17.2k3 gold badges36 silver badges43 bronze badges

1 Comment

Sinan Ünür Over a year ago

The OP needs the start positions as well as the lengths.

Anonymous · Accepted Answer · 2009-05-07 13:13:26Z

0

To match where the last match left off, use \G. perldoc perlre says (but consult your own installation's version's manual first):

The "\G" assertion can be used to chain global matches (using "m//g"), as described in "Regexp Quote-Like Operators" in perlop. It is also useful when writing "lex"-like scanners, when you have several patterns that you want to match against consequent substrings of your string, see the previous reference. The actual location where "\G" will match can also be influenced by using "pos()" as an lvalue: see "pos" in perlfunc. Note that the rule for zero-length matches is modified somewhat, in that contents to the left of "\G" is not counted when determining the length of the match. Thus the following will not match forever:
$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
    print $&;
}

answered May 7, 2009 at 13:13

Anonymous

50.6k1 gold badge27 silver badges19 bronze badges

2 Comments

Sinan Ünür Over a year ago

Why the gratuitous use of $& ? See perldoc perlreref: "The use of $`, $& or $' will slow down all regex use within your program. Consult perlvar for "@-" to see equivalent expressions that won't cause slow down."

Anonymous Over a year ago

Probably to make an illustrative point. Take it up with the perlre author.

Collectives™ on Stack Overflow

Find multiple substrings in strings and record location

5 Answers 5

Try this perl program

3 Comments

Comments

Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Try this perl program

3 Comments

Comments

Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related