string pattern matching using python

Question

I am working on python and bio sequences.
I have a sequence.

seq1 = \
...        """ atgaaatttatcattgaacgtgagcatctgctaaaaccactgcaacaggtcagtagcccg
...        ctgggtggacgccctacgttgcctattttgggtaacttgttgctgcaagtcacggaaggc
...        tctttgcggctgaccggtaccgacttggagatggagatggtggcttgtgttgccttgtct
...        cagtcccatgagccgggtgctaccacagtacccgcacggaagttttttgatatctggcgt
...        ggtttacccgaaggggcggaaattacggtagcgttggatggtgatcgcctgctagtgcgc
...        tctggtcgcagccgtttctcgctgtctaccttgcctgcgattgacttccctaatctggat
...        gactggcagagtgaggttgaattcactttaccgcaggctacgttaaagcgtctgattgag
...        tccactcagttttcgatggcccatcaggatgtccgttattatttgaacggcatgctgttt
...        gagaccgaaggcgaagagttacgtactgtggcgaccgatgggcatcgcttggctgtatgc
...        tcaatgcctattggccagacgttaccctcacattcggtgatcgtgccgcgtaaaggtgtg
...        atggagctggttcggttgctggatggtggtgatacccccttgcggctgcaaattggcagt
...        aataatattcgtgctcatgtgggcgattttattttcacatctaagctggttgatggccgt
...        ttcccggattatcgccgcgtattgccgaagaatcctgataaaatgctggaagccggttgc
...        gatttactgaaacaggcattttcgcgtgcggcaattctgtcaaatgagaagttccgtggt
...        gttcggctctatgtcagccacaatcaactcaaaatcactgctaataatcctgaacaggaa
...        gaagcagaagagatcctcgatgttagctacgaggggacagaaatggagatcggtttcaac
...        gtcagctatgtgcttgatgtgctaaatgcactgaagtgcgaagatgtgcgcctgttattg
...        actgactctgtatccagtgtgcagattgaagacagcgccagccaagctgcagcctatgtc
...        gtcatgccaatgcgtttgtag"""

seq2 = \
...        """ accgtagcatctgctaaaaccagtacgcccg
...        ctgggtggacgatgcaacttgttgctgcaagtcacggaaggc
...        tctttgcggctgaccggtaccgacttggagatggagatggtggcttgtgttgccttgtct
...        cagtcccatgagccgggtgctaccacagtacccgcacggaagttttttgatatctggcgt
...        ggtttacccgaaggggcggaaattacggtagcgttggatggtgcatgatcgcctgctagtgcgc
...        tctggtcgcagccgtttctcgctgtctaccttgcctgcgattgacttccctaatctggat
...        gactggcagagtgaggttgaattcactttaccgcaggctacgttaaagcgtctgattgag
...        tccactcagttttcgatgctatttatgtccgttattatttgaacggcatgctgttt
...        gagaccgaaggcgaagagttacgtactgtggcgaccgatgggcatcgcttggctgtatgc
...        tcaatgcctattggccaggctaattcggtgatcgtgccgcgtaaaggtgtg
...        atggagctggttcggttgctggatggtggtgatacccccggcccctgcaaattggcagt
...        aataatattcgtgctcatgtgggcgattttattttcacatctaagctggttgatggccgt
...        ttcccggattatcgccgcgtattgccgaagaatcctgataaaatgctggaagccggttgc
...        gtcatgccaatgcgtttgtag"""

I want to find out that how many strings in seq1 and seq2 are same and their respective positions. This is not only pattern matching but getting the positions as well. can anyone tell me how can i do the same using python?

"How many strings in seq1 and seq2 are the same" -- Can you be more specific? Is there any constraint on how long a "string" is, or where it starts? — mgilson
– mgilson, Commented Aug 27, 2012 at 13:46
Don't start with "using Python". Start with "at all", because you need to have an algorithm for this first. — Deestan
– Deestan, Commented Aug 27, 2012 at 13:47
Also, is there any significance in the linebreaks, or are they there just to make it easier to read the lines (e.g., should they be stripped out when "matching"? — mgilson
– mgilson, Commented Aug 27, 2012 at 13:48
@mgilson : no. from the given sequences, I have to search and see how many strings are matching with another sequences, and then I have to write the matching strings and their positions — sam
– sam, Commented Aug 27, 2012 at 13:49

schacki · Accepted Answer · 2012-08-27 18:47:19Z

The indexer function will return all posistions as a list

def indexer(s, sub):
    positions=[]
    pos=0
    while True:
        pos=s.find(sub,pos+1)
        if pos==-1:
            return positions
        else:
            positions.append(pos)

The matcher function will return a dict. Each key in the the dict is a sequences that is available in both a and b, the respective dict value is 2 item tuple that contains all matching positions for a and all matching positions for b:

def matcher(a,b):
    sequences=set()
    for l in range(1,len(a)):
        for pos in range(len(a)):
            sequences.add(a[pos:pos+l])
    for l in range(1,len(b)):
        for pos in range(len(b)):
            sequences.add(b[pos:pos+l]) 
    matches={}
    for seq in sequences:
        matches_a=indexer(a,seq)
        matches_b=indexer(b,seq)
        if result_a and result_b:
            matches[seq]=(matches_a,matches_b)
    return matches

This example should work:

print matcher('asdfasdfa','asdfasasdfasdfasdfadfasdfdf')

Eero Aaltonen · Accepted Answer · 2012-08-27 13:57:16Z

0

Perhaps Wikibooks can help you get started?

edited Aug 27, 2012 at 13:57

answered Aug 27, 2012 at 13:44

Eero Aaltonen

4,4651 gold badge31 silver badges41 bronze badges

3 Comments

sam Over a year ago

but string will not be fixed. i have to match all the string cases from both the available sequences and then I want to find the strings and occurances

schacki Over a year ago

Can you explain in more detail, what you mean with "all string cases from both the available sequences" Of any length?

Matthias Over a year ago

I think he's looking for the longest common subsequence.

Pierre GM · Accepted Answer · 2012-08-27 13:58:49Z

0

You could just use index:

>>> seq.index(str)
1046

Note that it'll find you the position of first occurence. You could then try to find other occurences from slices.

EDITED

When there are several occurences, a loop like such could work:

test = seq1 + ""
try:
    while test:
        position = test.index(str_)
        positions.append(position + last_position)
        position += len(str_)
        last_position += position
        test = test[position:]
except ValueError:
    print positions

We make a copy of the seq string because we'll consume it. Then, we keep checking a position with the index method, storing it in positions and updating the string accordingly.

[PS] Bad, bad idea to call a variable str, you're overwriting a built-in...

edited Aug 27, 2012 at 13:58

answered Aug 27, 2012 at 13:42

Pierre GM

20.5k3 gold badges58 silver badges67 bronze badges

Collectives™ on Stack Overflow

string pattern matching using python

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related