0

There are any efficient way to split a sequence like this not using [:] slicing?

GATAAG  G  ATAAG
        GA  TAAG
        GAT  AAG
        GATA  AG
        GATAA  G

I found something in itertools, but not do it right:

def subslices(seq):
    "Return all contiguous non-empty subslices of a sequence"
    # subslices('ABCD') --> A AB ABC ABCD B BC BCD C CD D
    slices = itertools.starmap(slice, itertools.combinations(range(len(seq) + 1), 2))
    return map(operator.getitem, itertools.repeat(seq), slices)

list(subslices(s))
['G', 'GA', 'GAT', 'GATA', 'GATAA', 'GATAAG', 'A', 'AT', 'ATA', 'ATAA', 'ATAAG', 'T', 'TA', 'TAA', 'TAAG', 'A', 'AA', 'AAG', 'A', 'AG', 'G']

And also Not readable. Other solution:

def splitting_kmer(s):
    n = len(s)
    print(n)
    for i, _ in enumerate(s, 1):
        if i == n:
            break
        print(s[:n-i], s[n-i:])

Paulo

6
  • 5
    What's wrong with [:] slicing? Commented Sep 8, 2022 at 22:34
  • Just curious if there are something different to learn. Thanks Commented Sep 8, 2022 at 22:36
  • There's always something different to learn, but doing so is pointless unless there is some use to it. Given how simple and elegant slicing is, that can hardly be it. And slicing is also fairly efficient, so what type of string splitting, or what application of it are you looking for? In what way could it be better - or in what way do you need it to be? (note that both 'solutions' you included still use slicing with slice and :) Commented Sep 8, 2022 at 22:54
  • 2
    "I need a efficient way to split the words" - that's easy with slicing, and I seriously doubt the slicing of the word is anywhere near a performance bottleneck for a task like that. That's like optimising the walking route to your car before taking a cross-country roadtrip to save time. Commented Sep 9, 2022 at 1:06
  • 1
    seconding - I suspect it's most efficient to slice here unless you're using a scientific Python library like NumPy because it will get a view of the string rather than creating a new string, and further to create a generator (you may find you can even delegate like yield from map instead of return _) if your caller is just going to iterate over the results Commented Sep 9, 2022 at 1:21

1 Answer 1

1

A simple and efficient way to get all unique substrings of a string:

sample = 'GATAAG'

slices = set(sample[i:j] for i in range(len(sample)) for j in range(i+1, len(sample)))

print(slices)

Result:

{'AA', 'AT', 'GATA', 'A', 'GATAA', 'G', 'GA', 'TA', 'T', 'ATA', 'TAA', 'ATAA', 'GAT'}

They are in random order because it's a set (which is unordered by definition), and they're in a set to ensure there are no duplicates. If you want duplicates and order:

sample = 'GATAAG'

slices = [sample[i:j] for i in range(len(sample)) for j in range(i+1, len(sample))]

print(slices)

Result:

['G', 'GA', 'GAT', 'GATA', 'GATAA', 'A', 'AT', 'ATA', 'ATAA', 'T', 'TA', 'TAA', 'A', 'AA', 'A']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.