0

What regex should be used for extracting multiple text blocks delimited by theirs headers that also should be parsed, for example:

some text info before message sequence
============
first message header that should be parsed (may contain = character)
============
first multiline
message body that
should also be parsed
(may contain = character)
============
second message header that should be parsed
============
second multiline
message body that
should also be parsed
... and so on

I was trying to use:

String regex = "^=+$\n"+
        "^(.+)$\n"+
        "^=+$\n"+
        "((?s:(?!(^=.+)).+))";
Pattern p = Pattern.compile(regex, Pattern.MULTILINE);

But ((?s:(?!(^=.+)).+)) eats second message as weel. This is a test showing a problem:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.junit.Assert;
import org.junit.Test;
public class ParsingTest {
@Test
public void test() {
    String fstMsgHeader = "first message header that should be parsed (may contain = character)";
    String fstMsgBody = "first multiline\n"+
                        "message body that\n"+
                        "should also be parsed\n"+
                        "(may contain = character)";
    String sndMsgHeader = "second message header that should be parsed";
    String sndMsgBody = "second multiline\n"+
            "message body that\n"+
            "should also be parsed\n"+
            "... and so on";
    String sample = "some text info before message sequence\n"+
                    "============\n"+
                    fstMsgHeader+"\n"+
                    "============\n"+
                    fstMsgBody+"\n"+
                    "============\n"+
                    sndMsgHeader+"\n"+
                    "============\n"+
                    sndMsgBody +"\n";
    System.out.println(sample);
    String regex =  "^=+$\n"+
                    "^(.+)$\n"+
                    "^=+$\n"+
                    "((?s:(?!(^=.+)).+))";
    Pattern p = Pattern.compile(regex, Pattern.MULTILINE);
    Matcher matcher = p.matcher(sample);
    int blockNumber = 1;
    while (matcher.find()) {
        System.out.println("Block "+blockNumber+": "+matcher.group(0)+"\n_________________");
        if (blockNumber == 1) {
            Assert.assertEquals(fstMsgHeader, matcher.group(1));
            Assert.assertEquals(fstMsgBody, matcher.group(2));
        } else {
            Assert.assertEquals(sndMsgHeader, matcher.group(1));
            Assert.assertEquals(sndMsgBody, matcher.group(2));
        }
    }
}

}

4
  • 4
    Why not using sample.split("============") ? Commented Aug 20, 2013 at 15:03
  • 1
    What output do you expect to have, and which one do you actually have? Commented Aug 20, 2013 at 15:12
  • Reg. split usage: I've finished with split, but it seems that capturing message and its header with one regex makes code more clear (one while loop with group accessors). So I am trying to consider this variant. Commented Aug 20, 2013 at 15:25
  • 2 sp00m: for each while iteration i want to extract message header and its body. First iteration successfuly extracts first message header (Assert.assertEquals(fstMsgHeader, matcher.group(1)); - passes, but matcher.group(2) captures first mesage body plus the rest of string). Commented Aug 20, 2013 at 15:34

1 Answer 1

1

I am not sure if that is what you are looking for but maybe this regex will help

String regex = 
        "={12}\n" +   // twelve '=' marks and new line mark
        "(.+?)" +     // minimal match that has
        "\n={12}\n" + // new line mark with twelve '=' marks after it
        "(.+?)(?=\n={12}|$)"; // minimal match that will have new line
                              // character and twelve `=` marks after
                              // it or end of data $

and to make it work you should make dot to also match new line characters with Pattern.DOTALL flag.

Pattern p = Pattern.compile(regex, Pattern.DOTALL);
Sign up to request clarification or add additional context in comments.

2 Comments

Pshemo, thank you it is working. Can you describe what (.+?) means?
@MikhailTsaplin normally (.+?) is greedy so it will try to find maximal possible. If you add ? it will make + quantifier reluctant so it will try to find minimal match. More info at docs.oracle.com/javase/tutorial/essential/regex/quant.html.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.