Python : Regex, Finding Repetitions on a string

Question

I need to find repetitions in a text string. I already found a very nice elegant solution here from @Tim Pietzcker

I am happy with the solution as is but would like to know whether it's possible to extend it little further such that it would accept a string with whitespaces.

For example "a bcab c" would return [(abc,2)]

I tried using the regex pattern "([^\s]+?)\1+") with no luck. Any help is much appreciated.

if in python, you could simply do no_whitespaces = input_str.replace(" ","") and then do your regex on no_whitespaces — e.s.
– e.s., Commented Mar 21, 2019 at 1:15
Hi e.s, That is one possibility but my application is to find the patterns on a bigger text structure. so whenever possible would like to keep the spaces between them because I am planning to highlight the found text once the match is made — XYZ
– XYZ, Commented Mar 21, 2019 at 2:54
If you want to highlight the found text once the match is made, as per your above example the output should be [(a bc,2)] ? If not, how are you going to highlight the text once the match is made? — sanooj
– sanooj, Commented Mar 21, 2019 at 5:00

sanooj · Accepted Answer · 2019-03-21 03:53:42Z

1

You should think about removing " " from the text first. You can do it by regex itself.

>>> def repetitions(s):
...    r = re.compile(r"(.+?)\1+")
...    for match in r.finditer(re.sub(r'\s+',"",s)):
...        yield (match.group(1), len(match.group(0))/len(match.group(1)))
...

Output.

>>> list(repetitions("a bcab c"))
[('abc', 2)]

If you still want to retain the space in the original text, Try this regex: r"(\s*\S+\s*?\S*?)\1+" . But this has limitations.

>>> def repetitions(s):
...    r = re.compile(r"(\s*\S+\s*?\S*?)\1+")
...    for match in r.finditer(s):
...        yield (match.group(1), len(match.group(0))/len(match.group(1)))
...

Results:

>>> list(repetitions(" abc abc "))
[(' abc', 2)]
>>> list(repetitions("abc abc "))
[('abc ', 2)]
>>> list(repetitions(" ab c ab c "))
[(' ab c', 2)]
>>> list(repetitions("ab cab c "))
[('ab c', 2)]
>>> list(repetitions("blablabla"))
[('bla', 3)]

edited Mar 21, 2019 at 3:53

answered Mar 21, 2019 at 2:54

sanooj

4935 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

XYZ Over a year ago

Thanks Sanooj, I ended up replacing spaces and then matching the group back in with with a newly compiled regex with added spaces. For example the match "abc", will be fed into a new regex with "\s*".join('abc'). Thanks, heaps for your time again.

Fatih Aktaş · Accepted Answer · 2019-03-21 01:58:04Z

0

Using (\S+ ?\S?)\1, you can make it tolerable to spaces for strings as below where the positions of the spaces are in the same location in the repetetive words ab c.

ab cab c

However, if the space locations in the repetitive words are not the same. Then it means, you have to replace the meaningless spaces with an empty string "" to find the repetitive words with your approach.

answered Mar 21, 2019 at 1:58

Fatih Aktaş

1,60414 silver badges29 bronze badges

1 Comment

XYZ Over a year ago

Hi Faith , Thanks heaps for your input, but spaces are irregular as shown in my example

Collectives™ on Stack Overflow

Python : Regex, Finding Repetitions on a string

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related