0

Using python, I am trying to find any sequence of characters in a string by specifying the length of this chain of characters.

For Example, if we have the following variable, I want to extract any identical sequence of characters with a length of 5:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"

the result should be:

11111
11111

how can I do that?

8
  • 1
    Counter could be your friend. Commented Feb 19, 2019 at 14:14
  • You should use regex to match a repeated expression. This post should help: stackoverflow.com/a/1660739/7692562 Commented Feb 19, 2019 at 14:15
  • @user5173426, can you elaborate? Counter by itself doesn't tell you anything about consecutive runs of identical characters. Commented Feb 19, 2019 at 14:16
  • 2
    @user5173426 Counter is not useful here because the characters have to be adjacent, itertools.groupby could be used though Commented Feb 19, 2019 at 14:16
  • 1
    @user5173426 I think I misunderstood the OP, I think they mean "identify sequences of n identical characters, no "identify identical n-long sequences within the string". Commented Feb 19, 2019 at 14:28

6 Answers 6

3

itertools to the rescue :)

>>> import itertools
>>> val = 5
>>> x
'jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111'
>>> [y[0]*val for y in itertools.groupby(x) if len(list(y[1])) == val]
['11111', '11111']

Edit: naming well

>>> [char*val for char,grouper in itertools.groupby(x) if len(list(grouper)) == val]
['11111', '11111']

Or the more memory efficient oneliner suggested by @Chris_Rands

>>> [k*val for k, g in itertools.groupby(x) if sum(1 for _ in g) == val]
Sign up to request clarification or add additional context in comments.

Comments

2

Or if you are fine with using regex, makes your code a lot cleaner:

[row[0] for row in re.findall(r'((.)\2{4,})', s)]

regex101 - example

4 Comments

This pattern does indeed match the sequence that OP is looking for. But search only finds the first instance. Is it possible to find all instances?
@hansolo, that works for the OP's sample input, but I think that he also wants sequences that don't contain the character "1". For example, "22222 foo QQQQQ" should return ["22222", "QQQQQ"]
@Kevin Then something like ', '.join(y*5 for y in re.findall(r'(.)\1{4}', x))
Looking good, now :-) I was hoping there would be a findall-based solution that captures only and exactly the full sequences, so that no list comp would be required. But I don't think you can match the sequence without capturing the first character by itself.
1

The original answer (below) is for a different problem (identifying repeated patterns of n characters in the string). Here is one possible one liner to solve the problem:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
res = [x[i:i + n] for i, c in enumerate(x) if x[i:i + n] == c * n]
print(res)
# ['11111', '11111']

Original (wrong) answer

Using Counter:

from collections import Counter

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
c = Counter(x[i:i + n] for i in range(len(x) - n + 1))
for k, v in c.items():
    if v > 1:
        print(*([k] * v), sep='\n')

Output:

**111
**111
*1111
*1111
11111
11111
1111*
1111*
111**
111**

1 Comment

although its for a different problem, I liked this. +1
1

Very ugly solution :-)

x = "jhg**11111**jjhgj**11111**klhhkjh22222jhjkh1111"
for c, i in enumerate(x):
    if i == x[c+1:c+2] and i == x[c+2:c+3] and i == x[c+3:c+4] and i == x[c+4:c+5]:
        print(x[c:c+5])

2 Comments

Style tip: consider using for c, i in enumerate(x): instead of manually incrementing a count variable.
Thanks. I edited my code. Still ugly, but should work :-)
0

try this:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"

seq_length = 5

for item in set(x):
    if seq_length*item in x:
        for i in range(x.count(seq_length*item)):
            print(seq_length*item)

it works by leveraging set() to easily construct the sequence you're looking for and then searches for it in the text

outputs your desired output:

11111
11111

Comments

0

Let's change a little your source string:

x = "jhg**11111**jjhgj**22222**klhhkjh33333jhjkh44444"

The regex should be:

pat = r'(.)\1{4}'

Here you have a capturing group (a single char) and a backreference to it (4 times), so totally the same char must occur 5 times.

One variant to print the result, although less intuitive is:

res = re.findall(pat, x)
print(res)

But the above code prints:

['1', '2', '3', '4']

i.e. a list, where each position is only the capturing group (in our case the first char), not the whole match.

So I propose also the second variant, with finditer and printing both start position and the whole match:

for match in re.finditer(pat, x):
    print('{:2d}: {}'.format(match.start(), match.group()))

For the above data the result is:

 5: 11111
19: 22222
33: 33333
43: 44444

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.