2

I have long file like 1200 sequences

>3fm8|A|A0JLQ2
CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP
QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP


>2ht9|A|A0JLT0
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA
LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL

I want to read each possible pattern has cysteine in middle and has in the beginning five string and follow by other five string such as xxxxxCxxxxx

the output should be like this:

  • QDIQLCGMGIL
  • ILPEHCIIDIT
  • TISDNCVVIFS
  • FSKTSCSYCTM

this is the pogram only give position of C . it is not work like what I want

pos=[]

def find(ch,string1):

    for i in range(len(string1)):
        if ch == string1[i]:
            pos.append(i)
            return pos



z=find('C','AWERQRTCWERTYCTAAAACTTCTTT')

print z
2
  • Actually, I have use case here that needs guidance. Why not TSCSYCTMAKK instead of FSKTSCSYCTM, or would that matter? Commented Apr 13, 2016 at 22:53
  • I think I solved the problem by identifying all cases Commented Apr 13, 2016 at 23:21

2 Answers 2

2

You need to return outside the loop, you are returning on the first match so you only ever get a single character in your list:

def find(ch,string1):  
    pos = []
    for i in range(len(string1)):
        if ch == string1[i]:
            pos.append(i)
    return pos # outside

You can also use enumerate with a list comp in place of your range logic:

def indexes(ch, s1):  
    return [index for index, char in enumerate(s1)if char == ch and 5 >= index <= len(s1) - 6]

Each index in the list comp is the character index and each char is the actual character so we keep each index where char is equal to ch.

If you want the five chars that are both sides:

In [24]: s="CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP"

In [25]: inds = indexes("C",s)

In [26]: [s[i-5:i+6] for i in inds]
Out[26]: ['QDIQLCGMGIL', 'ILPEHCIIDIT']

I added checking the index as we obviously cannot get five chars before C if the index is < 5 and the same from the end.

You can do it all in a single function, yielding a slice when you find a match:

def find(ch, s):
    ln = len(s)
    for i, char in enumerate(s):
        if ch == char and 5 <= i <= ln - 6:
            yield s[i- 5:i + 6]

Where presuming the data in your question is actually two lines from yoru file like:

s="""">3fm8|A|A0JLQ2CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDALYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCY"""

Running:

for line in s.splitlines():
    print(list(find("C" ,line)))

would output:

['0JLQ2CFLVNL', 'QDIQLCGMGIL', 'ILPEHCIIDIT']
['TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

Which gives six matches not four as your expected output suggest so I presume you did not include all possible matches.

You can also speed up the code using str.find, starting at the last match index + 1 for each subsequent match

def find(ch, s):
    ln, i = len(s) - 6, s.find(ch)
    while 5 <= i <= ln:
        yield s[i - 5:i + 6]
        i = s.find(ch, i + 1)

Which will give the same output. Of course if the strings cannot overlap you can start looking for the next match much further in the string each time.

Sign up to request clarification or add additional context in comments.

5 Comments

I noticed the output he provided skips the "C's" in the middle for some cases (like the one I commented). I was wondering if that's intentional? (Especially since your solution shows multiple cases)
@Adib, I am rereading the question myself, some of it actually does not add up
think u for help but this will only make the position for each C
@samooo, s[index-5:index+6] but not sure about all of your expected output
@samooo can you explain your output? Like, which outputs matter?
1

My solution is based on regex, and shows all possible solutions using regex and while loop. Thanks to @Smac89 for improving it by transforming it into a generator:

import re

string = """CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP

LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL"""

# Generator
def find_cysteine2(string):

    # Create a loop that will utilize regex multiple times
    # in order to capture matches within groups
    while True:
        # Find a match
        data = re.search(r'(\w{5}C\w{5})',string)

        # If match exists, let's collect the data
        if data:
            # Collect the string
            yield data.group(1)

            # Shrink the string to not include 
            # the previous result
            location = data.start() + 1
            string = string[location:]

        # If there are no matches, stop the loop
        else:
            break

print [x for x in find_cysteine2(string)]
# ['QDIQLCGMGIL', 'ILPEHCIIDIT', 'TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

5 Comments

I would make this function into a generator in order to reduce the memory overhead for large input
@samooo Yup! And it doesn't care if it is one line, two line, 1200 lines, it'll find it fast :D
@Smac89 I'm not that proficient with generators yet. Could you provide a code example?
It's simple with your current code. Replace output.append(data.group(1)) with yield data.group(1), next remove all instances of "output" from the function (including the return statement). Finally to call the generator, you do print list(find_cysteine(string))
@Smac89 Wow...the efficiency possible with this D: Thank you so much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.