2
string1 = "abcdbcdbcde"

I want to extract string1 into three parts: (first part and third part can be empty string)

first part: a

second part (repeitions of some string): bcdbcdbcd

third part: e

import re

string1 = "abcdbcdbcde"
m = re.match("(.*)(.+){2,}(.*)", string1)
print m.groups()[0], m.groups()[1], m.groups()[2]

Of cuz, the code above doesn't work.

As I know, parentheses operator can be used as RegEx capturing group or reference to the pattern. How to use the parentheses operator in these 2 cases at the same time?

What I want:

m.groups()[0] = "a"
m.groups()[1] = "bcdbcdbcd"
m.groups()[2] = "e"
1
  • Should the second part be a repetition of the same string? Like bcd bcd or like ab ab ab ab? Commented May 31, 2019 at 6:31

4 Answers 4

3

If the second part should be a repetition of the same string, you could use an optional first a and third part. For the second part you could use a capturing group and a back reference:

^.?(.+)\1+.?$

Regex demo

Or if you want all capturing groups:

^(.?)((.+)\3+)(.?)$
  • ^ Start of string
  • (.?) Group 1, optionally match any char
  • ( Group 2
    • (.+)\3+ Group 3, match any char followed by a backreference to group 3 repeated 1+ gimes
  • ) Close group 3
  • (.?) Group 4, optionally match any char
  • $ End of string

Regex demo

Sign up to request clarification or add additional context in comments.

Comments

1

My take on the problem:

import re

def match(s, m):
    m = re.match("(.*?)?((?:" + m + "){2,})(.*?)?$", s)
    return (m.groups()[0], m.groups()[1], m.groups()[2]) if m else (None, None, None)

print(match("abcdbcdbcde", "bcd"))
print(match("bcdbcdbcd", "bcd"))
print(match("abcdbcdbcd", "bcd"))
print(match("bcdbcdbcde", "bcd"))
print(match("axxbcdbcdxxe", "bcd"))
print(match("axxbcdxxe", "bcd")) # only one bcd in the middle

Prints:

('a', 'bcdbcdbcd', 'e')
('', 'bcdbcdbcd', '')
('a', 'bcdbcdbcd', '')
('', 'bcdbcdbcd', 'e')
('axx', 'bcdbcd', 'xxe')
(None, None, None)

Comments

0

I think it is impossible to match exatcly your requirements, as more captuing groups are needed (at least to repeat matching same string with \1).

But you can try (\w+)((\w+)\3+)(\w+)

It will consists of 4 capturing groups. Generally, first capturing group will contain a and last will contain e, second will contain repeated string, rest are irrelevant.

Explanation:

\w+ - match one or more of word characters

\3+ - match string captured in third capturing group, one ore more times

Demo

Comments

0

The following regex should work (caveat below):

^(.*?)((.+?)\3+)(.*)

Explanation:

^      # Start of string
(.*?)  # Match any number of characters, as few as possible, until...
(      # (Start capturing group #2)
 (.+?) # ... a string is matched (and captured in group #3)
 \3+   # that is repeated at least once.
)      # End of group #2
(.*)   # Match the rest of the string

Test it live on regex101.com.

Caveat: If the string is long and doesn't have any obvious repeats, this is going to have very bad performance characteristics (O(n!), I think), since the regex engine has to check each and every permutation of substrings. See catastrophic backtracking.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.