1

I have text file (seq.fasta) which contains sequence as follows

M1

MPMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKLGLDFPNL
PYLIDGSHKITQSNAILRYLARKHHLDGETEEERIRADIVENQVMDTRMQLIMLCYNPDF
EKQKPEFLKTIPEKMKLYSEFLGKRPWFAGDKVTYVDFLAYDILDQYRMFEPKCLDAFPN
LRDFLARFEGLKKISAYMKSSRYIATPIFSKMAHWSNK

I have to extract motif PXXP exactly 4 characters (XX can be any characters).

I tried following code:

import re

infile=open("seq.fasta",'r')

out=open("out.csv",'w')

for line in infile:

   line = line.strip("\n")

   if line.startswith('>'):

      name=line

   else:

      motif = re.compile(r"(\bP{2}P\b)")

      c = line.count('motif')

      print '%s:%s' %(name,c)

      out.write('%s:%s\n' %(name,c))

But it is not finding motif.

2
  • There is no string P..P in the provided input above (here . stands for "any character"). Don't get the question. Please update with expected output. You're regexp say to look for a wordboundary, followed by 2Ps, followed by a P and then a wordboundary Commented Sep 8, 2011 at 9:52
  • @Fredrik The P..P string appears split across the first two lines. Presumably, the entire string is intended to represent a line from the file. Commented Sep 8, 2011 at 11:06

2 Answers 2

5

Try with this one:

 re.compile(r"(P..P)")

. means any character.

{2} means that the last token must be repeated twice times (in your regex, this means PP.

\b matches word boundaries

Sign up to request clarification or add additional context in comments.

Comments

3

You can use this:

re.compile( r"(P[\w]{2}P)" )

or

re.compile( r"(P[A-Z]{2}P)" )

Meta \w - means alphanumeric characters, similar to [A-Z0-9_]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.