10

I have a character string 'aabaacaba'. Starting from left, I am trying to get substrings of all sizes >=2, which appear later in the string. For instance, aa appears again in the string and so is the case with ab.

I wrote following regex code:

re.findall(r'([a-z]{2,})(?:[a-z]*)(?:\1)', 'aabaacaba')

and I get ['aa'] as answer. Regular expression misses ab pattern. I think this is because of overlapping characters. Please suggest a solution, so that the expression could be fixed. Thank you.

2
  • 2
    Does this have to be done with regex? Commented May 14, 2017 at 2:34
  • @Chris Not necessarily. But it would be great if it could be done with regex. Commented May 14, 2017 at 2:44

1 Answer 1

9

You can use look-ahead assertion which does not consume matched string:

>>> re.findall(r'(?=([a-z]{2,})(?=.*\1))', 'aabaacaba')
['aa', 'aba', 'ba']

NOTE: aba matched instead of ab. (trying to match as long as possible)

Sign up to request clarification or add additional context in comments.

10 Comments

Can [a-z] be replaced with \w as (?=(\w{2,})(?=.*\1)) (?)
@SebastiánPalma, Yes it is. But it will also match digits, _. I'm not sure whether it's what OP wants or not; so I left it as is (as OP wrote). Maybe . is more appropriate if OP wants any character.
@SebastiánPalma, I couldn't use look-behind assertion, because Python re allow only fixed-length look-behind assertion.
@Sumit, Without the first look-ahead assertion, first matched part will be consumed; overlapped matches(aba in this case) will be excluded in the result.
@falsetru great answer. I couldn't think about 1st look-ahead assertion. Learnt something new today :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.