1

I'm trying to build a regular expression which captures multiple groups, with some of them being contained in others. For instance, let's say I want to capture every 4-grams that follows a 'to' prefix:

input = "I want to run to get back on shape"
expectedOutput = ["run to get back", "get back on shape"]

In that case I would use this regex:

"to((?:[ ][a-zA-Z]+){4})"

But it only captures the first item in expectedOutput (with a space prefix but that's not the point). This is quite easy to solve without regex, but I'd like to know if it is possible only using regex.

2 Answers 2

1

You can make use of a regex overlapping mstrings:

String s = "I want to run to get back on shape";
Pattern pattern = Pattern.compile("(?=\\bto\\b((?:\\s*[\\p{L}\\p{M}]+){4}))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    System.out.println(matcher.group(1).trim()); 
} 

See IDEONE demo

The regex (?=\bto\b((?:\s*[\p{L}\p{M}]+){4})) checks each location in the string (since it is a zero width assertion) and looks for:

  • \bto\b - a whole word to
  • ((?:\s*[\p{L}\p{M}]+){4}) - Group 1 capturing 4 occurrences of
    • \s* zero or more whitespace(s)
    • [\p{L}\p{M}]+ - one or more letters or diacritics

If you want to allow capturing fewer than 4 ngrams, use a {0,4} (or {1,4} to require at least one) greedy limiting quantifier instead of {4}.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot, that's exactly what I was looking for but I was not familiar yet with the zero-width positive lookahead ?=
0

It is the order of groups in Regex

1       ((A)(B(C)))   // first group (surround two other inside this)
2       (A)           // second group ()
3       (B(C))        // third group (surrounded one other group)
4       (C)           // forth group ()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.