4

Given an excerpt of text like

Preface (optional, up to multiple lines)
Main : sequence1
   sequence2
   sequence3
   sequence4
Epilogue (optional, up to multiple lines)

which Java regular expression could be used to extract all the sequences (i.e. sequence1, sequence2, sequence3, sequence4 above)? For example, a Matcher.find() loop?

Each "sequence" is preceded by and may also contain 0 or more white spaces (including tabs).

The following regex

(?m).*Main(?:[ |t]+:(?:[ |t]+(\S+)[\r\n])+

only yields the first sequence (sequence1).

3
  • Does it mean you need to get multiple matches of the non-whitespace chunks that have some horizontal whitespaces on the subsequent lines after Main :? Commented Dec 20, 2016 at 23:47
  • Use String p = "(?m)(?:\\G(?!\\A)[^\\S\r\n]+|^Main\\s*:\\s*)(\\S+)\r?\n?"; Commented Dec 20, 2016 at 23:53
  • One match per line. Your regex works, thanks and +1. Commented Dec 20, 2016 at 23:58

1 Answer 1

3

You may use the following regex:

(?m)(?:\G(?!\A)[^\S\r\n]+|^Main\s*:\s*)(\S+)\r?\n?

Details:

  • (?m) - multiline mode on
  • (?:\G(?!\A)[^\S\r\n]+|^Main\s*:\s*) - either of the two:
    • \G(?!\A)[^\S\r\n]+ - end of the previous successful match (\G(?!\A)) and then 1+ horizontal whitespaces ([^\S\r\n]+, can be replaced with [\p{Zs}\t]+ or [\s&&[^\r\n]]+)
    • | - or
    • ^Main\s*:\s* - start of a line, Main, 0+ whitespaces, :, 0+ whitespaces
  • (\S+) - Group 1 capturing 1+ non-whitespace symbols
  • \r?\n? - an optional CR and an optional LF.

See the Java code below:

String p = "(?m)(?:\\G(?!\\A)[^\\S\r\n]+|^Main\\s*:\\s*)(\\S+)\r?\n?";
String s = "Preface (optional, up to multiple lines)...\nMain : sequence1\n   sequence2\n   sequence3\n   sequence4\nEpilogue (optional, up to multiple lines)";
Matcher m = Pattern.compile(p).matcher(s);
while(m.find()) {
    System.out.println(m.group(1));
}
Sign up to request clarification or add additional context in comments.

4 Comments

It works, thanks. Ideally, I would like something without anchors (\G or \A), but still it does the job. Maybe a simpler version exists. :-)
With 1 regex pass, this is the only way.
It is definitely an elegant one. If no simpler version appears, it will also become the selected answer. Thanks again. :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.