7

I am looking for occurrence of "CCGTCAATTC(A|C)TTT(A|G)AGT" in a text file.

$text = 'CCGTCAATTC(A|C)TTT(A|G)AGT'; if ($line=~/$text/){ chomp($line); $pos=index($line,$text); }

Searching is working, but I am not able to get the position of "text" in line. It seems index does not accepts a regular expression as substring.

How can I make this work. Thanks

4 Answers 4

14

The @- array holds the offsets of the starting positions of the last successful match. The first element is the offset of the whole matching pattern, and subsequent elements are offsets of parenthesized subpatterns. So, if you know there was a match, you can get its offset as $-[0].

Sign up to request clarification or add additional context in comments.

Comments

3

You don't need to use index at all, just a regex. The portion of $line that comes before your regex match will be stored in $` (or $PREMATCH if you've chosen to use English;). You can get the index of the match by checking the length of $`, and you can get the match itself from the $& (or $MATCH) variable:

$text = 'CCGTCAATTC(A|C)TTT(A|G)AGT';
if ($line =~ /$text/) {
    $pos = length($PREMATCH);
}

Assuming you want to get $pos to continue matching on the remaining part of $line, you can use the $' (or $POSTMATCH) variable to get the portion of $line that comes after the match.

See http://perldoc.perl.org/perlvar.html for detailed information on these special variables.

8 Comments

Yes, I can do that. But once I capture the position then I am to capture the next 50 chars: substr($line,$pos,50)
You can match on the remaining part of $line the way you said -- is that approach undesirable for some reason? You could also use the $' (or $POSTMATCH) variable to easily get the remaining part of $line.
Please see my amended answer; let me know if you're looking for something else.
Yes, you are correct. Only thing is that this way I will loose the text. I mean text+x=50 words
Your worked, but as thought, I am missing the matching string.
|
1

Based on your comments, it seems like what you are after is matching the 50 characters directly following the match. So, a simple solution would be:

my ($match) = $line =~ /CCGTCAATTC[AC]TTT[AG]AGT(.{50})/;

As you see, [AG] is equivalent to A|G. If you wish to match multiple times, you can use an array @matches, and the /g global option on the regex. E.g.

my @matches = $line =~ /CCGTCAATTC[AC]TTT[AG]AGT(.{50})/g;

You can do this to keep the matching pattern:

my ($pattern, $match) = $line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;

Or in a loop:

while ($line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;) {
    my ($pattern, $match) = ($1, $2);
}

4 Comments

Actually, I need the matching chars so may be have to cheat and instead of 50 put 38
Wouldn't your question have been a whole lot simpler to answer if you'd said from the start what you wanted? =) Well, assuming you do know how many characters you want to capture, I think you can work out how to fix it.
This also gave me another idea, so that TLP
Yes, could have been bit more clear for the next few steps I was trying to do... Will be careful for next time
0
while ($line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;) {

I like it, but no ; in while.

I had hard times to search for the reason of errors. T_T.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.