python string splitting with multiple splitting points

Question

Ok so ill get straight to the point here is my code

def digestfragmentwithenzyme(seqs, enzymes):

fragment = []
for seq in seqs:
    for enzyme in enzymes:
        results = []
        prog = re.compile(enzyme[0])
        for dingen in prog.finditer(seq):
           results.append(dingen.start() + enzyme[1])
        results.reverse()
        #result = 0
        for result in results:
            fragment.append(seq[result:])
            seq = seq[:result]
        fragment.append(seq[:result])
fragment.reverse()
return fragment

Input for this function is a list of multiple strings (seq) e.g. :

List = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]

And enzymes as input:

[["TC", 1],["GC",1]]

(note: there can be multiple given but most of them are in this matter of letters with ATCG)

The function should return a list that, in this example, contain 2 lists:

Outputlist = [["AATT","CCGGT","CGGGG","CT","CGGGGG"],["AAAG","CAAAAT","CAAAAAAG","CAAAAAAT","C"]]

Right now i am having troubles with splitting it twice and getting the right output.

Little bit more information about the function. It looks through the string (seq) for the recognizion point. in this case TC or GC and splits it on the 2nd index of enzymes. it should do that for both strings in the list with both enzymes.

It might help to elaborate what exactly the "right output" is. If your program does not do what you want then it won't help us readers to understand what exactly the relation between the input sequence, the enzyme list and the output list is. It's obvious that it is more than a simple search for substrings. — Risadinha
– Risadinha, Commented Mar 31, 2017 at 18:37
Well for starters prog is a regex and should operate on a string, while seq is a list of strings, so prog.finditer(seq) is an error. You need to work with one input string at a time. — Alex Hall
– Alex Hall, Commented Mar 31, 2017 at 18:37
@AlexHall yes i tried it with for seq in seqs (changed it in the parameters aswel) but it didnt give me the correct output — Nathan Weesie
– Nathan Weesie, Commented Mar 31, 2017 at 18:43
@Risadinha the right output is also giving its the outputlist. the function should give this if it is correctly programmed — Nathan Weesie
– Nathan Weesie, Commented Mar 31, 2017 at 18:43
Well that code was a step closer because it didn't raise an exception and end on the 4th line, so show us that code. — Alex Hall
– Alex Hall, Commented Mar 31, 2017 at 18:44

pbuck · Accepted Answer · 2017-03-31 19:14:55Z

1

Assuming the idea is to split at each enzyme, at the index point where enzymes are multiple letters, and the split, in essence comes between the two letters. Don't need regex.

You can do this by looking for the occurrences and inserting a split indicator at the correct index and then post-process the result to actually split.

For example:

def digestfragmentwithenzyme(seqs, enzymes):
    # preprocess enzymes once, then apply to each sequence
    replacements = []
    for enzyme in enzymes:
        replacements.append((enzyme[0], enzyme[0][0:enzyme[1]] + '|' + enzyme[0][enzyme[1]:]))
    result = []
    for seq in seqs:
        for r in replacements:
            seq = seq.replace(r[0], r[1])   # So AATTC becomes AATT|C
        result.append(seq.split('|'))       # So AATT|C becomes AATT, C
    return result

def test():
    seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
    enzymes = [["TC", 1],["GC",1]]
    print digestfragmentwithenzyme(seqs, enzymes)

edited Mar 31, 2017 at 19:14

answered Mar 31, 2017 at 19:02

pbuck

4,5902 gold badges28 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Nathan Weesie Over a year ago

No the enzymes can be longer then 2 letters and the index can be greater or less then 2. it can be anywere from 0-5 and the letters have no min or max length

pbuck Over a year ago

So, for enzyme ['AAT', 2], then 'AATACCG' becomes 'AA', 'TACCG' , but for ['AAT', 1] it would be 'A', 'AATCCG' ?

Nathan Weesie Over a year ago

yes exactly but ['AAT',1] would become [ 'A', 'ATCCG']

pbuck Over a year ago

easy, update the split point on the replacements.append line. I've updated my answer.

Charlie G Over a year ago

oh whoops, misunderstood your code, disregard my edits

PKey · Accepted Answer · 2017-03-31 20:57:48Z

Here is my solution:

Replace TC with T C, GC with G C (this is done based on index given) and then split based on space character....

def digest(seqs, enzymes):
    res = []
    for li in seqs:
        for en in enzymes: 
            li = li.replace(en[0],en[0][:en[1]]+" " + en[0][en[1]:])
        r = li.split()
        res.append(r)
    return res
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
#enzymes = [["AAT", 2],["GC",1]]
print seqs
print digest(seqs, enzymes)

the results are:

for ([["TC", 1],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAA
AAAAT', 'C']]

for ([["AAT", 2],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AA', 'TTCCGGTCGGGG', 'CTCGGGGG'], ['AAAG', 'CAAAA', 'TCAAAAAAG', 'CAAAAAA', '
TC']]

Julien Spronck · Accepted Answer · 2017-03-31 19:16:53Z

0

Here is something that should work using regex. In this solution, I find all occurrences of your enzyme strings and split using their corresponding index.

def digestfragmentwithenzyme(seqs, enzymes):
    out = []
    dic = dict(enzymes) # dictionary of enzyme indices

    for seq in seqs:
        sub = []
        pos1 = 0

        enzstr = '|'.join(enz[0] for enz in enzymes) # "TC|GC" in this case
        for match in re.finditer('('+enzstr+')', seq):
            index = dic[match.group(0)]
            pos2 = match.start()+index
            sub.append(seq[pos1:pos2])
            pos1 = pos2
        sub.append(seq[pos1:])
        out.append(sub)
        # [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
    return out

edited Mar 31, 2017 at 19:16

answered Mar 31, 2017 at 19:10

Julien Spronck

15.5k5 gold badges50 silver badges57 bronze badges

2 Comments

Nathan Weesie Over a year ago

i like yours but is there any way to make it work with 1 enzyme instead of it always needing 2 or more? maybe with: if enzymes > 1:

Julien Spronck Over a year ago

@NathanWeesie as far as I know, it already works with 1 enzyme ... why are you saying that the code needs 2 or more?

Ashwini Chaudhary · Accepted Answer · 2017-03-31 19:18:40Z

0

Use positive lookbehind and lookahead regex search:

import re


def digest_fragment_with_enzyme(sequences, enzymes):
    pattern = '|'.join('((?<={})(?={}))'.format(strs[:ind], strs[ind:]) for strs, ind in enzymes)
    print pattern  # prints ((?<=T)(?=C))|((?<=G)(?=C))
    for seq in sequences:
        indices = [0] + [m.start() for m in re.finditer(pattern, seq)] + [len(seq)]
        yield [seq[start: end] for start, end in zip(indices, indices[1:])]

seq = ["AATTCCGGTCGGGGCTCGGGGG", "AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1], ["GC", 1]]
print list(digest_fragment_with_enzyme(seq, enzymes))

Output:

[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'],
 ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]

edited Mar 31, 2017 at 19:18

answered Mar 31, 2017 at 19:11

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

Comments

BallpointBen · Accepted Answer · 2017-03-31 19:25:01Z

0

The simplest answer I can think of:

input_list = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = ['TC', 'GC']
output = []
for string in input_list:
    parts = []
    left = 0
    for right in range(1,len(string)):
        if string[right-1:right+1] in enzymes:
            parts.append(string[left:right])
            left = right
    parts.append(string[left:])
    output.append(parts)
print(output)

answered Mar 31, 2017 at 19:25

BallpointBen

15.6k2 gold badges46 silver badges81 bronze badges

Comments

Kenan Banks · Accepted Answer · 2017-03-31 20:14:05Z

0

Throwing my hat in the ring here.

Using dict for patterns rather than list of lists.
Joining pattern as others have done to avoid fancy regexes.

.

import re

sequences = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
patterns = { 'TC': 1, 'GC': 1 }

def intervals(patterns, text):
  pattern = '|'.join(patterns.keys())
  start = 0
  for match in re.finditer(pattern, text):
    index = match.start() + patterns.get(match.group())
    yield text[start:index]
    start = index
  yield text[index:len(text)]

print [list(intervals(patterns, s)) for s in sequences]

# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]

answered Mar 31, 2017 at 20:14

Kenan Banks

213k36 gold badges160 silver badges176 bronze badges

Collectives™ on Stack Overflow

python string splitting with multiple splitting points

6 Answers 6

5 Comments

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

5 Comments

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related