2

I have a requirement where I want to extract the content from a file which can have multiple occurrences of the pattern. Basically files containing multiple sections and I want to extra each section. The extracted content should include the string matching the pattern

Eg: File content

01
Community based Index1- 
...some text....
...some text..
Conclusion: The significant increase of testing 
...
some text. 

02
Community based Index2- 
.some text.
.some text.
Conclusion: The significant increase of testing 
...
...<End of para> 
:
:

I am trying with the following pattern but it is not working

String patternStart = "\\d{2}[^\\d.,)][\\s:-]?[\\r\\n][A-Z]";
String patternEnd = "Conclusion.*(\\n.*)*"; \\ including the entire para

I am trying with pattern matcher but it is not working, I am getting no match found.

 String regexString = Pattern.quote(patternStart)  + "(.*?)" + Pattern.quote(patternEnd);
 Pattern pattern = Pattern.compile(regexString);
 while (matcher.find()) {
            String textInBetween = matcher.group(1);
  }
1
  • Just a quick guess. Just you one back slash with r and n. To check for something up to the end of the line, terminate the regex with $ Commented Jun 1, 2020 at 12:59

1 Answer 1

1

You could use a single pattern to extract the whole section:

^\d+(?:\R(?!\d+\R|Conclusion:).*)*\RConclusion:\h+(.*(?:\R(?!\d+\R|Conclusion:).*)*)

Explanation

  • ^ Start of string
  • \d+ Match 1+ digits
  • (?: Non capture group
    • \R(?!\d+\R|Conclusion:).* Match a unicode newline sequence and the rest of the line if it does not start with either 1+ digits and a newline or Conclusion:
  • )* Close group and repeat 0+ times to match all the lines
  • \RConclusion:\h+ Match a newline and Conclusion: followed by 1+ horizontal whitespace chars
  • ( Capture group 1
    • .* Match the whole line
    • (?:\R(?!\d+\R|Conclusion:).*)* Repeat 0+ times matching all lines that do not start with either 1+ digits followed by a newline or Conclusion:
  • ) Close group 1

Regex demo

In Java

String regex = "^\\d+(?:\\R(?!\\d+\\R|Conclusion:).*)*\\RConclusion: (.*(?:\\R(?!\\d+\\R|Conclusion:).*)*)";

See a Java demo

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for your help. It is working for me partially. Actually there are some small variations to the content. Is there any other way that I can contact you to share some snippet of the file?
@NiranjanC You can add the content to the regex101 link and then on the left top you can either update or fork the regex and paste the updated link here in the comments and I can have a look at it. regex101.com/r/sIquEa/1
here is the updated link regex101.com/r/sIquEa/2 with masked data.
@NiranjanC Do you mean like this? regex101.com/r/g9JbzT/1
Sorry, yes that's the link
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.