3

I need to find repetitions in a text string. I already found a very nice elegant solution here from @Tim Pietzcker

I am happy with the solution as is but would like to know whether it's possible to extend it little further such that it would accept a string with whitespaces.

For example "a bcab c" would return [(abc,2)]

I tried using the regex pattern "([^\s]+?)\1+") with no luck. Any help is much appreciated.

3
  • 2
    if in python, you could simply do no_whitespaces = input_str.replace(" ","") and then do your regex on no_whitespaces Commented Mar 21, 2019 at 1:15
  • Hi e.s, That is one possibility but my application is to find the patterns on a bigger text structure. so whenever possible would like to keep the spaces between them because I am planning to highlight the found text once the match is made Commented Mar 21, 2019 at 2:54
  • If you want to highlight the found text once the match is made, as per your above example the output should be [(a bc,2)] ? If not, how are you going to highlight the text once the match is made? Commented Mar 21, 2019 at 5:00

2 Answers 2

1

You should think about removing " " from the text first. You can do it by regex itself.

>>> def repetitions(s):
...    r = re.compile(r"(.+?)\1+")
...    for match in r.finditer(re.sub(r'\s+',"",s)):
...        yield (match.group(1), len(match.group(0))/len(match.group(1)))
... 

Output.

>>> list(repetitions("a bcab c"))
[('abc', 2)]

If you still want to retain the space in the original text, Try this regex: r"(\s*\S+\s*?\S*?)\1+" . But this has limitations.

>>> def repetitions(s):
...    r = re.compile(r"(\s*\S+\s*?\S*?)\1+")
...    for match in r.finditer(s):
...        yield (match.group(1), len(match.group(0))/len(match.group(1)))
... 

Results:

>>> list(repetitions(" abc abc "))
[(' abc', 2)]
>>> list(repetitions("abc abc "))
[('abc ', 2)]
>>> list(repetitions(" ab c ab c "))
[(' ab c', 2)]
>>> list(repetitions("ab cab c "))
[('ab c', 2)]
>>> list(repetitions("blablabla"))
[('bla', 3)]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Sanooj, I ended up replacing spaces and then matching the group back in with with a newly compiled regex with added spaces. For example the match "abc", will be fed into a new regex with "\s*".join('abc'). Thanks, heaps for your time again.
0

Using (\S+ ?\S?)\1, you can make it tolerable to spaces for strings as below where the positions of the spaces are in the same location in the repetetive words ab c.

ab cab c 

However, if the space locations in the repetitive words are not the same. Then it means, you have to replace the meaningless spaces with an empty string "" to find the repetitive words with your approach.

1 Comment

Hi Faith , Thanks heaps for your input, but spaces are irregular as shown in my example

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.