How to match nested repeating groups with regex in Java?

Question

I'm trying to match repeating groups with Java:

String s = "The very first line\n"
        + "\n"
        + "AA (aa)\n"
        + "BB (bb)\n"
        + "CC (cc)\n"
        + "\n";

Pattern p = Pattern.compile(
        "The very first line\\s+"
        + "((?<gr1>[a-z]+)\\s+\\((?<gr2>[^)]+)\\)\\s*)+",
        Pattern.DOTALL | Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(s);

if (m.find()) {
    for (int i = 0; i <= m.groupCount(); i++) {
        System.out.println("group #" + i + ": [" + m.group(i).trim() + "]");
    }
    System.out.println("group gr1: [" + m.group("gr1").trim() + "]");
    System.out.println("group gr2: [" + m.group("gr2").trim() + "]");
}

The problem is with the repeating groups: though the regex matches the whole text block (see group #0 in output example below), when retrieving groups #2 and #3 (or by name as well - gr1/gr2) it does return only the last match (CC/cc) and skips the previous ones (AA/aa and BB/bb)

group #0: [The very first line

AA (aa)
BB (bb)
CC (cc)]
group #1: [CC (cc)]
group #2: [CC]
group #3: [cc]
group gr1: [CC]
group gr2: [cc]

Is there a way to solve this?

edit: The very first line is in the pattern as identification string - see the comment to the gknicker's answer below

gknicker · Accepted Answer · 2015-02-07 22:41:46Z

1

It seems like you wanted your pattern to match not the whole input string, but just the individual repeating sections. If that's true, your pattern would be:

    Pattern p = Pattern.compile(
        "((?<gr1>[a-z]+)\\s+\\((?<gr2>[^)]+)\\))",
        Pattern.CASE_INSENSITIVE);

Then in this case you would have a while loop to find each match:

    Matcher m = p.matcher(s);

    while (m.find()) {
        System.out.println("group gr1: ["
            + m.group("gr1").trim() + "]");
        System.out.println("group gr2: ["
            + m.group("gr2").trim() + "]");
    }

But if you need the whole match, you'll probably have to use two patterns like this:

    String s = "The very first line\n"
        + "\n"
        + "AA (aa)\n"
        + "BB (bb)\n"
        + "CC (cc)\n"
        + "\n";

    Pattern p = Pattern.compile(
        "The very first line\\s+(([a-z]+)\\s+\\(([^)]+)\\)\\s*)+",
        Pattern.CASE_INSENSITIVE);

    Pattern p2 = Pattern.compile(
        "((?<gr1>[a-z]+)\\s+\\((?<gr2>[^)]+)\\))",
        Pattern.CASE_INSENSITIVE);

    Matcher m = p.matcher(s);
    while (m.find()) {
        Matcher m2 = p2.matcher(m.group());
        while (m2.find()) {
            System.out.println("group gr1: ["
                + m2.group("gr1").trim() + "]");
            System.out.println("group gr2: ["
                + m2.group("gr2").trim() + "]");
        }
    }

edited Feb 7, 2015 at 22:41

answered Feb 7, 2015 at 20:49

gknicker

5,5982 gold badges29 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nobwyn Over a year ago

Thanks for the suggestion, but I use the "The very first line" as orientation string - the text I am trying to match contains multiple sections, having such repeating groups (corresponding to the same pattern). If I would "just" match gr1 and gr2, then I would need to extract the block I need before trying to match it, otherwise I would get incorrect results

gknicker Over a year ago

See my edit for an alternate solution with two patterns.

Collectives™ on Stack Overflow

How to match nested repeating groups with regex in Java?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related