extract substring pattern

Question

I have long file like 1200 sequences

>3fm8|A|A0JLQ2
CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP
QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP


>2ht9|A|A0JLT0
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA
LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL

I want to read each possible pattern has cysteine in middle and has in the beginning five string and follow by other five string such as xxxxxCxxxxx

the output should be like this:

QDIQLCGMGIL
ILPEHCIIDIT
TISDNCVVIFS
FSKTSCSYCTM

this is the pogram only give position of C . it is not work like what I want

pos=[]

def find(ch,string1):

    for i in range(len(string1)):
        if ch == string1[i]:
            pos.append(i)
            return pos



z=find('C','AWERQRTCWERTYCTAAAACTTCTTT')

print z

Actually, I have use case here that needs guidance. Why not TSCSYCTMAKK instead of FSKTSCSYCTM, or would that matter? — Adib
– Adib, Commented Apr 13, 2016 at 22:53

Padraic Cunningham · Accepted Answer · 2016-04-14 00:12:06Z

2

You need to return outside the loop, you are returning on the first match so you only ever get a single character in your list:

def find(ch,string1):  
    pos = []
    for i in range(len(string1)):
        if ch == string1[i]:
            pos.append(i)
    return pos # outside

You can also use enumerate with a list comp in place of your range logic:

def indexes(ch, s1):  
    return [index for index, char in enumerate(s1)if char == ch and 5 >= index <= len(s1) - 6]

Each index in the list comp is the character index and each char is the actual character so we keep each index where char is equal to ch.

If you want the five chars that are both sides:

In [24]: s="CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP"

In [25]: inds = indexes("C",s)

In [26]: [s[i-5:i+6] for i in inds]
Out[26]: ['QDIQLCGMGIL', 'ILPEHCIIDIT']

I added checking the index as we obviously cannot get five chars before C if the index is < 5 and the same from the end.

You can do it all in a single function, yielding a slice when you find a match:

def find(ch, s):
    ln = len(s)
    for i, char in enumerate(s):
        if ch == char and 5 <= i <= ln - 6:
            yield s[i- 5:i + 6]

Where presuming the data in your question is actually two lines from yoru file like:

s="""">3fm8|A|A0JLQ2CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDALYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCY"""

Running:

for line in s.splitlines():
    print(list(find("C" ,line)))

would output:

['0JLQ2CFLVNL', 'QDIQLCGMGIL', 'ILPEHCIIDIT']
['TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

Which gives six matches not four as your expected output suggest so I presume you did not include all possible matches.

You can also speed up the code using str.find, starting at the last match index + 1 for each subsequent match

def find(ch, s):
    ln, i = len(s) - 6, s.find(ch)
    while 5 <= i <= ln:
        yield s[i - 5:i + 6]
        i = s.find(ch, i + 1)

Which will give the same output. Of course if the strings cannot overlap you can start looking for the next match much further in the string each time.

edited Apr 14, 2016 at 0:12

answered Apr 13, 2016 at 22:41

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Adib Over a year ago

I noticed the output he provided skips the "C's" in the middle for some cases (like the one I commented). I was wondering if that's intentional? (Especially since your solution shows multiple cases)

Padraic Cunningham Over a year ago

@Adib, I am rereading the question myself, some of it actually does not add up

samooo Over a year ago

think u for help but this will only make the position for each C

Padraic Cunningham Over a year ago

@samooo, s[index-5:index+6] but not sure about all of your expected output

Adib Over a year ago

@samooo can you explain your output? Like, which outputs matter?

Adib · Accepted Answer · 2016-04-14 00:09:30Z

1

My solution is based on regex, and shows all possible solutions using regex and while loop. Thanks to @Smac89 for improving it by transforming it into a generator:

import re

string = """CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP

LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL"""

# Generator
def find_cysteine2(string):

    # Create a loop that will utilize regex multiple times
    # in order to capture matches within groups
    while True:
        # Find a match
        data = re.search(r'(\w{5}C\w{5})',string)

        # If match exists, let's collect the data
        if data:
            # Collect the string
            yield data.group(1)

            # Shrink the string to not include 
            # the previous result
            location = data.start() + 1
            string = string[location:]

        # If there are no matches, stop the loop
        else:
            break

print [x for x in find_cysteine2(string)]
# ['QDIQLCGMGIL', 'ILPEHCIIDIT', 'TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

edited Apr 14, 2016 at 0:09

answered Apr 13, 2016 at 23:01

Adib

1,3443 gold badges16 silver badges32 bronze badges

5 Comments

smac89 Over a year ago

I would make this function into a generator in order to reduce the memory overhead for large input

Adib Over a year ago

@samooo Yup! And it doesn't care if it is one line, two line, 1200 lines, it'll find it fast :D

Adib Over a year ago

@Smac89 I'm not that proficient with generators yet. Could you provide a code example?

smac89 Over a year ago

It's simple with your current code. Replace output.append(data.group(1)) with yield data.group(1), next remove all instances of "output" from the function (including the return statement). Finally to call the generator, you do print list(find_cysteine(string))

Adib Over a year ago

@Smac89 Wow...the efficiency possible with this D: Thank you so much!

Collectives™ on Stack Overflow

extract substring pattern

2 Answers 2

5 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related