Finding a sequence of characters in string

Question

Using python, I am trying to find any sequence of characters in a string by specifying the length of this chain of characters.

For Example, if we have the following variable, I want to extract any identical sequence of characters with a length of 5:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"

the result should be:

11111
11111

how can I do that?

You should use regex to match a repeated expression. This post should help: stackoverflow.com/a/1660739/7692562 — foobarbaz
– foobarbaz, Commented Feb 19, 2019 at 14:15
@user5173426, can you elaborate? Counter by itself doesn't tell you anything about consecutive runs of identical characters. — Kevin
– Kevin, Commented Feb 19, 2019 at 14:16
@user5173426 Counter is not useful here because the characters have to be adjacent, itertools.groupby could be used though — Chris_Rands
– Chris_Rands, Commented Feb 19, 2019 at 14:16
@user5173426 I think I misunderstood the OP, I think they mean "identify sequences of n identical characters, no "identify identical n-long sequences within the string". — javidcf
– javidcf, Commented Feb 19, 2019 at 14:28

han solo · Accepted Answer · 2019-02-19 14:39:03Z

3

itertools to the rescue :)

>>> import itertools
>>> val = 5
>>> x
'jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111'
>>> [y[0]*val for y in itertools.groupby(x) if len(list(y[1])) == val]
['11111', '11111']

Edit: naming well

>>> [char*val for char,grouper in itertools.groupby(x) if len(list(grouper)) == val]
['11111', '11111']

Or the more memory efficient oneliner suggested by @Chris_Rands

>>> [k*val for k, g in itertools.groupby(x) if sum(1 for _ in g) == val]

edited Feb 19, 2019 at 14:39

answered Feb 19, 2019 at 14:17

han solo

6,6501 gold badge20 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

RnD · Accepted Answer · 2019-02-19 14:40:08Z

2

Or if you are fine with using regex, makes your code a lot cleaner:

[row[0] for row in re.findall(r'((.)\2{4,})', s)]

regex101 - example

edited Feb 19, 2019 at 14:40

answered Feb 19, 2019 at 14:21

RnD

1,0696 gold badges23 silver badges50 bronze badges

4 Comments

Kevin Over a year ago

This pattern does indeed match the sequence that OP is looking for. But search only finds the first instance. Is it possible to find all instances?

Kevin Over a year ago

@hansolo, that works for the OP's sample input, but I think that he also wants sequences that don't contain the character "1". For example, "22222 foo QQQQQ" should return ["22222", "QQQQQ"]

han solo Over a year ago

@Kevin Then something like ', '.join(y*5 for y in re.findall(r'(.)\1{4}', x))

Kevin Over a year ago

Looking good, now :-) I was hoping there would be a findall-based solution that captures only and exactly the full sequences, so that no list comp would be required. But I don't think you can match the sequence without capturing the first character by itself.

javidcf · Accepted Answer · 2019-02-19 14:44:31Z

1

The original answer (below) is for a different problem (identifying repeated patterns of n characters in the string). Here is one possible one liner to solve the problem:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
res = [x[i:i + n] for i, c in enumerate(x) if x[i:i + n] == c * n]
print(res)
# ['11111', '11111']

Original (wrong) answer

Using Counter:

from collections import Counter

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
c = Counter(x[i:i + n] for i in range(len(x) - n + 1))
for k, v in c.items():
    if v > 1:
        print(*([k] * v), sep='\n')

Output:

**111
**111
*1111
*1111
11111
11111
1111*
1111*
111**
111**

edited Feb 19, 2019 at 14:44

answered Feb 19, 2019 at 14:18

javidcf

59.9k7 gold badges87 silver badges134 bronze badges

1 Comment

DirtyBit Over a year ago

although its for a different problem, I liked this. +1

Xenobiologist · Accepted Answer · 2019-02-19 14:59:51Z

1

Very ugly solution :-)

x = "jhg**11111**jjhgj**11111**klhhkjh22222jhjkh1111"
for c, i in enumerate(x):
    if i == x[c+1:c+2] and i == x[c+2:c+3] and i == x[c+3:c+4] and i == x[c+4:c+5]:
        print(x[c:c+5])

edited Feb 19, 2019 at 14:59

answered Feb 19, 2019 at 14:43

Xenobiologist

2,1511 gold badge13 silver badges18 bronze badges

2 Comments

Kevin Over a year ago

Style tip: consider using for c, i in enumerate(x): instead of manually incrementing a count variable.

Xenobiologist Over a year ago

Thanks. I edited my code. Still ugly, but should work :-)

vencaslac · Accepted Answer · 2019-02-19 14:19:51Z

0

try this:

x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"

seq_length = 5

for item in set(x):
    if seq_length*item in x:
        for i in range(x.count(seq_length*item)):
            print(seq_length*item)

it works by leveraging set() to easily construct the sequence you're looking for and then searches for it in the text

outputs your desired output:

11111
11111

answered Feb 19, 2019 at 14:19

vencaslac

2,8811 gold badge21 silver badges32 bronze badges

Comments

Valdi_Bo · Accepted Answer · 2019-02-19 14:53:43Z

Let's change a little your source string:

x = "jhg**11111**jjhgj**22222**klhhkjh33333jhjkh44444"

The regex should be:

pat = r'(.)\1{4}'

Here you have a capturing group (a single char) and a backreference to it (4 times), so totally the same char must occur 5 times.

One variant to print the result, although less intuitive is:

res = re.findall(pat, x)
print(res)

But the above code prints:

['1', '2', '3', '4']

i.e. a list, where each position is only the capturing group (in our case the first char), not the whole match.

So I propose also the second variant, with finditer and printing both start position and the whole match:

for match in re.finditer(pat, x):
    print('{:2d}: {}'.format(match.start(), match.group()))

For the above data the result is:

Collectives™ on Stack Overflow

Finding a sequence of characters in string

6 Answers 6

Comments

4 Comments

1 Comment

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

4 Comments

1 Comment

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related