0

I am working on python and bio sequences.
I have a sequence.

seq1 = \
...        """ atgaaatttatcattgaacgtgagcatctgctaaaaccactgcaacaggtcagtagcccg
...        ctgggtggacgccctacgttgcctattttgggtaacttgttgctgcaagtcacggaaggc
...        tctttgcggctgaccggtaccgacttggagatggagatggtggcttgtgttgccttgtct
...        cagtcccatgagccgggtgctaccacagtacccgcacggaagttttttgatatctggcgt
...        ggtttacccgaaggggcggaaattacggtagcgttggatggtgatcgcctgctagtgcgc
...        tctggtcgcagccgtttctcgctgtctaccttgcctgcgattgacttccctaatctggat
...        gactggcagagtgaggttgaattcactttaccgcaggctacgttaaagcgtctgattgag
...        tccactcagttttcgatggcccatcaggatgtccgttattatttgaacggcatgctgttt
...        gagaccgaaggcgaagagttacgtactgtggcgaccgatgggcatcgcttggctgtatgc
...        tcaatgcctattggccagacgttaccctcacattcggtgatcgtgccgcgtaaaggtgtg
...        atggagctggttcggttgctggatggtggtgatacccccttgcggctgcaaattggcagt
...        aataatattcgtgctcatgtgggcgattttattttcacatctaagctggttgatggccgt
...        ttcccggattatcgccgcgtattgccgaagaatcctgataaaatgctggaagccggttgc
...        gatttactgaaacaggcattttcgcgtgcggcaattctgtcaaatgagaagttccgtggt
...        gttcggctctatgtcagccacaatcaactcaaaatcactgctaataatcctgaacaggaa
...        gaagcagaagagatcctcgatgttagctacgaggggacagaaatggagatcggtttcaac
...        gtcagctatgtgcttgatgtgctaaatgcactgaagtgcgaagatgtgcgcctgttattg
...        actgactctgtatccagtgtgcagattgaagacagcgccagccaagctgcagcctatgtc
...        gtcatgccaatgcgtttgtag"""

seq2 = \
...        """ accgtagcatctgctaaaaccagtacgcccg
...        ctgggtggacgatgcaacttgttgctgcaagtcacggaaggc
...        tctttgcggctgaccggtaccgacttggagatggagatggtggcttgtgttgccttgtct
...        cagtcccatgagccgggtgctaccacagtacccgcacggaagttttttgatatctggcgt
...        ggtttacccgaaggggcggaaattacggtagcgttggatggtgcatgatcgcctgctagtgcgc
...        tctggtcgcagccgtttctcgctgtctaccttgcctgcgattgacttccctaatctggat
...        gactggcagagtgaggttgaattcactttaccgcaggctacgttaaagcgtctgattgag
...        tccactcagttttcgatgctatttatgtccgttattatttgaacggcatgctgttt
...        gagaccgaaggcgaagagttacgtactgtggcgaccgatgggcatcgcttggctgtatgc
...        tcaatgcctattggccaggctaattcggtgatcgtgccgcgtaaaggtgtg
...        atggagctggttcggttgctggatggtggtgatacccccggcccctgcaaattggcagt
...        aataatattcgtgctcatgtgggcgattttattttcacatctaagctggttgatggccgt
...        ttcccggattatcgccgcgtattgccgaagaatcctgataaaatgctggaagccggttgc
...        gtcatgccaatgcgtttgtag"""

I want to find out that how many strings in seq1 and seq2 are same and their respective positions. This is not only pattern matching but getting the positions as well. can anyone tell me how can i do the same using python?

12
  • "How many strings in seq1 and seq2 are the same" -- Can you be more specific? Is there any constraint on how long a "string" is, or where it starts? Commented Aug 27, 2012 at 13:46
  • 5
    Don't start with "using Python". Start with "at all", because you need to have an algorithm for this first. Commented Aug 27, 2012 at 13:47
  • Also, is there any significance in the linebreaks, or are they there just to make it easier to read the lines (e.g., should they be stripped out when "matching"? Commented Aug 27, 2012 at 13:48
  • @mgilson : no. from the given sequences, I have to search and see how many strings are matching with another sequences, and then I have to write the matching strings and their positions Commented Aug 27, 2012 at 13:49
  • linebreaks are just to make it easier to read Commented Aug 27, 2012 at 13:49

3 Answers 3

1

The indexer function will return all posistions as a list

def indexer(s, sub):
    positions=[]
    pos=0
    while True:
        pos=s.find(sub,pos+1)
        if pos==-1:
            return positions
        else:
            positions.append(pos)

The matcher function will return a dict. Each key in the the dict is a sequences that is available in both a and b, the respective dict value is 2 item tuple that contains all matching positions for a and all matching positions for b:

def matcher(a,b):
    sequences=set()
    for l in range(1,len(a)):
        for pos in range(len(a)):
            sequences.add(a[pos:pos+l])
    for l in range(1,len(b)):
        for pos in range(len(b)):
            sequences.add(b[pos:pos+l]) 
    matches={}
    for seq in sequences:
        matches_a=indexer(a,seq)
        matches_b=indexer(b,seq)
        if result_a and result_b:
            matches[seq]=(matches_a,matches_b)
    return matches

This example should work:

print matcher('asdfasdfa','asdfasasdfasdfasdfadfasdfdf') 
Sign up to request clarification or add additional context in comments.

Comments

0

Perhaps Wikibooks can help you get started?

3 Comments

but string will not be fixed. i have to match all the string cases from both the available sequences and then I want to find the strings and occurances
Can you explain in more detail, what you mean with "all string cases from both the available sequences" Of any length?
I think he's looking for the longest common subsequence.
0

You could just use index:

>>> seq.index(str)
1046

Note that it'll find you the position of first occurence. You could then try to find other occurences from slices.

EDITED

When there are several occurences, a loop like such could work:

test = seq1 + ""
try:
    while test:
        position = test.index(str_)
        positions.append(position + last_position)
        position += len(str_)
        last_position += position
        test = test[position:]
except ValueError:
    print positions

We make a copy of the seq string because we'll consume it. Then, we keep checking a position with the index method, storing it in positions and updating the string accordingly.

[PS] Bad, bad idea to call a variable str, you're overwriting a built-in...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.