Python: Sub sequence search

Question

I have a sequence 'abccabac' and a subsequence 'abc'. I need to get indices of all occurences of subsequence 'abc' in the sequence. What is the memory efficient way of doing it?

Example:

Input:

**sequence**: 'abccabac' **subsequence**: 'abc'

Output:

asking for a memory efficient way might suggests you have a way that is not memory efficient. Maybe you would be willing to share it with us to give us something to work with? — Tadhg McDonald-Jensen
– Tadhg McDonald-Jensen, Commented May 25, 2016 at 16:45
This is the 'needle-in-a-haystack' problem. The Knuth-Morris-Pratt algorithm is generally considered the most efficient algorithm in terms of time and space complexity. Implement it and see. — Akshat Mahajan
– Akshat Mahajan, Commented May 25, 2016 at 16:50
@AkshatMahajan that looks like an algorithm to find a consecutive substring without repeated rechecking, This is not asking for consecutive characters. — Tadhg McDonald-Jensen
– Tadhg McDonald-Jensen, Commented May 25, 2016 at 16:53
@TadhgMcDonald-Jensen ? OP says he's looking for "indices of all occurrences of subsequence 'abc' in the sequence". KMP is search algorithm for finding occurrence of substring in text. It can be fairly easily generalised to not halt at the first occurrence. Not sure I understand what the issue is. — Akshat Mahajan
– Akshat Mahajan, Commented May 25, 2016 at 16:57
yes but finding a word in a text means that all the letters of the word are obviously consecutive (come right after each other) but sub sequence of abc only occurs once in the sequence abccabac right at the beginning. the indices 0 5 7 are indices of an a a b and a c but not right next to each other. — Tadhg McDonald-Jensen
– Tadhg McDonald-Jensen, Commented May 25, 2016 at 17:08

Tadhg McDonald-Jensen · Accepted Answer · 2016-05-25 19:23:22Z

2

the absolute most strait forward way I can think of is using itertools.combinations

import itertools

sequence = "abccabac"

subsequence = "abc"

for combo in itertools.combinations(enumerate(sequence), len(subsequence)):
    if "".join(pair[1] for pair in combo) == subsequence:
        print([pair[0] for pair in combo])

If your actual sequence contains lots of characters that are not even in the subsequence then filtering out the irrelevant characters before starting the combinations would definitely provide a substantial efficiency boost:

char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
    ...

as well instead of joining each combination you can just use all which will only check characters until one is found to be not equal:

char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
    if all(pair[1]==char for pair,char in zip(combo,subsequence)):
        print([pair[0] for pair in combo])

Here is a different alternative I just put together, I imagine the algorithm could be optimized even more but this is about the best I could come up with. The sub_sequence_generator creates a list of indices for each letter of the sub sequence, then the sub_helper for each level of the recursion goes through all indices for the next letter of the subsequence starting after the indice of the last letter.

import itertools
import bisect

def sub_helper(cur_i, combo, all_options):
    #cur_i is the index of the letter in the subsequence
    #cur_idx is the index of the sequence which contains that letter
    if cur_i>0:
        prev_i = combo[cur_i-1]
    else:
        prev_i = -1
    cur_options = all_options[cur_i]
    start_i = bisect.bisect(cur_options,prev_i)
    cur_i_gen = itertools.islice(cur_options,start_i,None)
    if cur_i+1 == len(all_options): #last one:
        for cur_idx in cur_i_gen:
            combo[cur_i] = cur_idx
            yield tuple(combo)
    else:
        for cur_idx in cur_i_gen:
            combo[cur_i] = cur_idx
            yield from sub_helper(cur_i+1, combo, all_options)


def sub_sequence_generator(sequence,sub_seq):
    indices_map = {c:[] for c in set(sub_seq)} #create a list for each character
    for i,c in enumerate(sequence):
        try:
            indices_map[c].append(i,) #the helper is simpler if they are all stored as single element tuples
        except KeyError:
            pass

    # now we have indices for each character of the sub_sequence
    # we just need to put them in a sequence that corelates to the subsequence
    chr_indices = tuple(indices_map[c] for c in sub_seq)

    return sub_helper(0,[None]*len(chr_indices), chr_indices)

sequence = "abccabac"
subsequence = "abc"

for idxs in sub_sequence_generator(sequence,subsequence):
    print(idxs)

In terms of memory this will make a list for each character of the subsequence, containing the indices in the main sequence where that character is present. A tuple of these lists, and a list of the indices that is continually updated and a tuple for each time a combination is yielded, and iterators like islice so this is extremely memory efficient, however since I have not extensively tested it I cannot give any guarantee it is bug free.

edited May 25, 2016 at 19:23

answered May 25, 2016 at 16:50

Tadhg McDonald-Jensen

21.6k5 gold badges40 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Akshat Mahajan Over a year ago

Not sure if you'd call this efficient. Invoking combinations creates 56 elements with this sequence (8 letters) and subsequence (3 letters), which are fairly small inputs, making runtime approximately O(n^2) - I shudder to think what the length would look like for longer text. :P

Tadhg McDonald-Jensen Over a year ago

maybe not efficient in runtime but it is efficient in memory which is what the question was specifically asking for, I am looking into another way of doing this with sets but the logic is not as simple as this one.

Tadhg McDonald-Jensen Over a year ago

I'm not sure how it relates to big O notation but combinations always generates n! / r! / (n-r)! where n is the length of the sequence and r is the length of subsequence, I completely agree that this is not at all the best way to do it.

verdict_11 Over a year ago

@TadhgMcDonald-Jensen Thanks ! This is exactly what I am trying to do, my subsequence may not go more than 6-7 chars long, but my sequence may be pretty long, using itertools may make the search space explode.

Tadhg McDonald-Jensen Over a year ago

@verdict_11 I added a recursive solution, other then heavily relying on concatenating tuples it is fairly memory and execution efficient.

|

Collectives™ on Stack Overflow

Python: Sub sequence search

Example:

Input:

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Example:

Input:

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related