1

I'm not able to get the number of occurrences of a substring that has n-lenght in a string. For example if the string is

CCCATGGTtaGGTaTGCCCGAGGT

and n is

3

The output must be something like :

'CCC' : 2, 'GGT' :3

The input is a list of lists so I get evry string of list but Im not able to go ahead and the output is the dic of all strings

Code:

def get_all_n_repeats(n,sq_list):
    reps={}
    for i in sq_list:
        if not i:
            continue
        else:   
            for j in i:
                ........#Here the code I want to do#......                  
return reps
6
  • Why is it GGT and not GTt? Commented May 29, 2016 at 19:52
  • You need to at least show something you have tried. Commented May 29, 2016 at 19:55
  • Your output and your input don't make sense. If you split your input string into three letter strings, you get ['CCC', 'ATG', 'GTt', 'aGG', 'TaT', 'GCC', 'CGA', 'GGT'] so I don't know where you got GGT in your output. Commented May 29, 2016 at 19:57
  • 4
    What is so unclear about this question? It makes perfect sense. Commented May 29, 2016 at 20:03
  • 1
    @BurhanKhalid I think his candidates are ['CCC', 'CCA', 'CAT', 'ATG', 'TGG', 'GGT', 'GTt', 'Tta', 'taG', 'aGG', 'GGT', 'GTa', 'TaT', 'aTG', 'TGC', 'GCC', 'CCC', 'CCG', 'CGA', 'GAG', 'AGG', 'GGT']. Commented May 29, 2016 at 20:04

3 Answers 3

2

A really simple solution:

from collections import Counter

st = "CCCATGGTtaGGTaTGCCCGAGGT"
n = 3

tokens = Counter(st[i:i+n] for i in range(len(st) - n + 1))
print tokens.most_common(2)

After it is up to you to make it a helper function.

Sign up to request clarification or add additional context in comments.

Comments

1

A very explicit solution:

s = 'CCCATGGTtaGGTaTGCCCGAGGT'
n = 3
# All possible n-length strings
l = [s[i:i + n] for i in range(len(s) - (n - 1))]
# Count their distribution
d = {}
for e in l:
    d[e] = d.get(e, 0) + 1
print(d)

Comments

0

Use Counter

from collections import Counter

def count_occurrences(input, n):
    candidates = []
    for i, c in enumerate(st):
        try:
            candidates.append('{}{}{}'.format(st[i], st[i+1], st[i+2]))
        except IndexError:
            continue

    output = {}
    for k,v in Counter(candidates).items():
        if v > 1:
            output[k] = v

st = "CCCATGGTtaGGTaTGCCCGAGGT"
n = 3

count_occurrences(st, n)
# {'GGT': 3, 'CCC': 2}

1 Comment

Counter(candidates).most_common()

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.