I have text file (seq.fasta) which contains sequence as follows
M1
MPMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKLGLDFPNL
PYLIDGSHKITQSNAILRYLARKHHLDGETEEERIRADIVENQVMDTRMQLIMLCYNPDF
EKQKPEFLKTIPEKMKLYSEFLGKRPWFAGDKVTYVDFLAYDILDQYRMFEPKCLDAFPN
LRDFLARFEGLKKISAYMKSSRYIATPIFSKMAHWSNK
I have to extract motif PXXP exactly 4 characters (XX can be any characters).
I tried following code:
import re
infile=open("seq.fasta",'r')
out=open("out.csv",'w')
for line in infile:
line = line.strip("\n")
if line.startswith('>'):
name=line
else:
motif = re.compile(r"(\bP{2}P\b)")
c = line.count('motif')
print '%s:%s' %(name,c)
out.write('%s:%s\n' %(name,c))
But it is not finding motif.
P..Pin the provided input above (here.stands for "any character"). Don't get the question. Please update with expected output. You're regexp say to look for a wordboundary, followed by 2Ps, followed by a P and then a wordboundaryP..Pstring appears split across the first two lines. Presumably, the entire string is intended to represent a line from the file.