1

I have a text file, and I need to divide it into blocks using regex in java.

Each block starts with a number at the start of the line and the rest is indented by tabs.

for example:

1.  Here the block starts, and I need to capture
    all the text until the next block starts.
2.  The Second block.


3.  Another block.
        Some indented text.
4.  New block.

        More text.
            Still the 4th block.


    The end of the 4th block.

I tried few patterns, but I can't figure out how to do it.

I was thinking about:

  1. a number at the start of a line

  2. some text

  3. a number at the start of a line

But this way the number at (3) will not be included at the next match, and the pattern will not catch the next block.

3 Answers 3

1

You can try this regex:

^\d.+?(?=^\d|\Z)

Remember to use the multiline and dot-all options:

Matcher m = Pattern.compile("^\\d.+?(?=^\\d|\\Z)", Pattern.MULTILINE | Pattern.DOTALL).matcher(text);
while (m.find()) {
     // m.group() is each of your blocks
}

Explanation:

It starts by first matching a digit at the start of the line (^\d), then lazily match everything (.+?) until there is either 1) another start of the line, followed by another digit after it, or 2) the end of the string ((?=^\d|\Z)).

Sign up to request clarification or add additional context in comments.

2 Comments

We don't need multiline mode here, dot all is enough.
It works great with both multiline and dot-all flags, thanks for the explanation!
1

You might match 1+ digits and a dot at the start of the string and select any char 0+ times except a newline.

Then repeat matching all following lines that do not start with 1+ digits followed by a dot:

^\d+\..*(?:\r?\n(?!\d+\.).*)*

Explanation

  • ^ Start of string
  • \d+\..* Match 1+ digits followed by a dot and 0+ chars except a newline
  • (?: Non capturing group
    • \r?\n Match a newline
    • (?!\d+\.) Assert that what is directly on the right is not 1+ digits followed by a dot
    • .* Match any char except a newline 0+ times
  • )* Close non capturing group and repeat 0+ times

Regex demo | Java demo

2 Comments

Thanks for the reply! It also works, but can you explain the second part (?:\r?\n(?!\d+\.).*) more carefully?
@NDV I have added an explanation.
0

Try searching for the following pattern:

\d+\.\t(.*?)(?=\d+\.\t|$)

Here is a sample script:

List<String> blocks = new ArrayList<>();
String input = "1.\tsome content\n\tblah\n2.\tsome more content";
String pattern = "\\d+\\.\t(.*?)(?=\\d+\\.\t|$)";
Pattern r = Pattern.compile(pattern, Pattern.DOTALL);
Matcher m = r.matcher(input);
while (m.find()) {
    blocks.add(m.group(1));
    System.out.println("LINE: " + m.group(1));
}

LINE: some content
      blah

LINE: some more content

Note that we perform the regex search using DOTALL mode, because a given block may span across multiple lines.

2 Comments

Can you test your code against OP's given input? It doesn't seem to match anything.
No, I mean the string that the OP gave in the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.