0

Ok so ill get straight to the point here is my code

def digestfragmentwithenzyme(seqs, enzymes):

fragment = []
for seq in seqs:
    for enzyme in enzymes:
        results = []
        prog = re.compile(enzyme[0])
        for dingen in prog.finditer(seq):
           results.append(dingen.start() + enzyme[1])
        results.reverse()
        #result = 0
        for result in results:
            fragment.append(seq[result:])
            seq = seq[:result]
        fragment.append(seq[:result])
fragment.reverse()
return fragment

Input for this function is a list of multiple strings (seq) e.g. :

List = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]

And enzymes as input:

[["TC", 1],["GC",1]]

(note: there can be multiple given but most of them are in this matter of letters with ATCG)

The function should return a list that, in this example, contain 2 lists:

Outputlist = [["AATT","CCGGT","CGGGG","CT","CGGGGG"],["AAAG","CAAAAT","CAAAAAAG","CAAAAAAT","C"]]

Right now i am having troubles with splitting it twice and getting the right output.

Little bit more information about the function. It looks through the string (seq) for the recognizion point. in this case TC or GC and splits it on the 2nd index of enzymes. it should do that for both strings in the list with both enzymes.

9
  • It might help to elaborate what exactly the "right output" is. If your program does not do what you want then it won't help us readers to understand what exactly the relation between the input sequence, the enzyme list and the output list is. It's obvious that it is more than a simple search for substrings. Commented Mar 31, 2017 at 18:37
  • Well for starters prog is a regex and should operate on a string, while seq is a list of strings, so prog.finditer(seq) is an error. You need to work with one input string at a time. Commented Mar 31, 2017 at 18:37
  • @AlexHall yes i tried it with for seq in seqs (changed it in the parameters aswel) but it didnt give me the correct output Commented Mar 31, 2017 at 18:43
  • @Risadinha the right output is also giving its the outputlist. the function should give this if it is correctly programmed Commented Mar 31, 2017 at 18:43
  • 1
    Well that code was a step closer because it didn't raise an exception and end on the 4th line, so show us that code. Commented Mar 31, 2017 at 18:44

6 Answers 6

1

Assuming the idea is to split at each enzyme, at the index point where enzymes are multiple letters, and the split, in essence comes between the two letters. Don't need regex.

You can do this by looking for the occurrences and inserting a split indicator at the correct index and then post-process the result to actually split.

For example:

def digestfragmentwithenzyme(seqs, enzymes):
    # preprocess enzymes once, then apply to each sequence
    replacements = []
    for enzyme in enzymes:
        replacements.append((enzyme[0], enzyme[0][0:enzyme[1]] + '|' + enzyme[0][enzyme[1]:]))
    result = []
    for seq in seqs:
        for r in replacements:
            seq = seq.replace(r[0], r[1])   # So AATTC becomes AATT|C
        result.append(seq.split('|'))       # So AATT|C becomes AATT, C
    return result

def test():
    seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
    enzymes = [["TC", 1],["GC",1]]
    print digestfragmentwithenzyme(seqs, enzymes)
Sign up to request clarification or add additional context in comments.

5 Comments

No the enzymes can be longer then 2 letters and the index can be greater or less then 2. it can be anywere from 0-5 and the letters have no min or max length
So, for enzyme ['AAT', 2], then 'AATACCG' becomes 'AA', 'TACCG' , but for ['AAT', 1] it would be 'A', 'AATCCG' ?
yes exactly but ['AAT',1] would become [ 'A', 'ATCCG']
easy, update the split point on the replacements.append line. I've updated my answer.
oh whoops, misunderstood your code, disregard my edits
1

Here is my solution:

Replace TC with T C, GC with G C (this is done based on index given) and then split based on space character....

def digest(seqs, enzymes):
    res = []
    for li in seqs:
        for en in enzymes: 
            li = li.replace(en[0],en[0][:en[1]]+" " + en[0][en[1]:])
        r = li.split()
        res.append(r)
    return res
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
#enzymes = [["AAT", 2],["GC",1]]
print seqs
print digest(seqs, enzymes)

the results are:

for ([["TC", 1],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAA
AAAAT', 'C']]

for ([["AAT", 2],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AA', 'TTCCGGTCGGGG', 'CTCGGGGG'], ['AAAG', 'CAAAA', 'TCAAAAAAG', 'CAAAAAA', '
TC']]

Comments

0

Here is something that should work using regex. In this solution, I find all occurrences of your enzyme strings and split using their corresponding index.

def digestfragmentwithenzyme(seqs, enzymes):
    out = []
    dic = dict(enzymes) # dictionary of enzyme indices

    for seq in seqs:
        sub = []
        pos1 = 0

        enzstr = '|'.join(enz[0] for enz in enzymes) # "TC|GC" in this case
        for match in re.finditer('('+enzstr+')', seq):
            index = dic[match.group(0)]
            pos2 = match.start()+index
            sub.append(seq[pos1:pos2])
            pos1 = pos2
        sub.append(seq[pos1:])
        out.append(sub)
        # [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
    return out

2 Comments

i like yours but is there any way to make it work with 1 enzyme instead of it always needing 2 or more? maybe with: if enzymes > 1:
@NathanWeesie as far as I know, it already works with 1 enzyme ... why are you saying that the code needs 2 or more?
0

Use positive lookbehind and lookahead regex search:

import re


def digest_fragment_with_enzyme(sequences, enzymes):
    pattern = '|'.join('((?<={})(?={}))'.format(strs[:ind], strs[ind:]) for strs, ind in enzymes)
    print pattern  # prints ((?<=T)(?=C))|((?<=G)(?=C))
    for seq in sequences:
        indices = [0] + [m.start() for m in re.finditer(pattern, seq)] + [len(seq)]
        yield [seq[start: end] for start, end in zip(indices, indices[1:])]

seq = ["AATTCCGGTCGGGGCTCGGGGG", "AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1], ["GC", 1]]
print list(digest_fragment_with_enzyme(seq, enzymes))

Output:

[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'],
 ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]

Comments

0

The simplest answer I can think of:

input_list = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = ['TC', 'GC']
output = []
for string in input_list:
    parts = []
    left = 0
    for right in range(1,len(string)):
        if string[right-1:right+1] in enzymes:
            parts.append(string[left:right])
            left = right
    parts.append(string[left:])
    output.append(parts)
print(output)

Comments

0

Throwing my hat in the ring here.

  • Using dict for patterns rather than list of lists.
  • Joining pattern as others have done to avoid fancy regexes.

.

import re

sequences = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
patterns = { 'TC': 1, 'GC': 1 }

def intervals(patterns, text):
  pattern = '|'.join(patterns.keys())
  start = 0
  for match in re.finditer(pattern, text):
    index = match.start() + patterns.get(match.group())
    yield text[start:index]
    start = index
  yield text[index:len(text)]

print [list(intervals(patterns, s)) for s in sequences]

# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.