1

I'm trying to match repeating groups with Java:

String s = "The very first line\n"
        + "\n"
        + "AA (aa)\n"
        + "BB (bb)\n"
        + "CC (cc)\n"
        + "\n";

Pattern p = Pattern.compile(
        "The very first line\\s+"
        + "((?<gr1>[a-z]+)\\s+\\((?<gr2>[^)]+)\\)\\s*)+",
        Pattern.DOTALL | Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(s);

if (m.find()) {
    for (int i = 0; i <= m.groupCount(); i++) {
        System.out.println("group #" + i + ": [" + m.group(i).trim() + "]");
    }
    System.out.println("group gr1: [" + m.group("gr1").trim() + "]");
    System.out.println("group gr2: [" + m.group("gr2").trim() + "]");
}

The problem is with the repeating groups: though the regex matches the whole text block (see group #0 in output example below), when retrieving groups #2 and #3 (or by name as well - gr1/gr2) it does return only the last match (CC/cc) and skips the previous ones (AA/aa and BB/bb)

group #0: [The very first line

AA (aa)
BB (bb)
CC (cc)]
group #1: [CC (cc)]
group #2: [CC]
group #3: [cc]
group gr1: [CC]
group gr2: [cc]

Is there a way to solve this?

edit: The very first line is in the pattern as identification string - see the comment to the gknicker's answer below

0

1 Answer 1

1

It seems like you wanted your pattern to match not the whole input string, but just the individual repeating sections. If that's true, your pattern would be:

    Pattern p = Pattern.compile(
        "((?<gr1>[a-z]+)\\s+\\((?<gr2>[^)]+)\\))",
        Pattern.CASE_INSENSITIVE);

Then in this case you would have a while loop to find each match:

    Matcher m = p.matcher(s);

    while (m.find()) {
        System.out.println("group gr1: ["
            + m.group("gr1").trim() + "]");
        System.out.println("group gr2: ["
            + m.group("gr2").trim() + "]");
    }

But if you need the whole match, you'll probably have to use two patterns like this:

    String s = "The very first line\n"
        + "\n"
        + "AA (aa)\n"
        + "BB (bb)\n"
        + "CC (cc)\n"
        + "\n";

    Pattern p = Pattern.compile(
        "The very first line\\s+(([a-z]+)\\s+\\(([^)]+)\\)\\s*)+",
        Pattern.CASE_INSENSITIVE);

    Pattern p2 = Pattern.compile(
        "((?<gr1>[a-z]+)\\s+\\((?<gr2>[^)]+)\\))",
        Pattern.CASE_INSENSITIVE);

    Matcher m = p.matcher(s);
    while (m.find()) {
        Matcher m2 = p2.matcher(m.group());
        while (m2.find()) {
            System.out.println("group gr1: ["
                + m2.group("gr1").trim() + "]");
            System.out.println("group gr2: ["
                + m2.group("gr2").trim() + "]");
        }
    }
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the suggestion, but I use the "The very first line" as orientation string - the text I am trying to match contains multiple sections, having such repeating groups (corresponding to the same pattern). If I would "just" match gr1 and gr2, then I would need to extract the block I need before trying to match it, otherwise I would get incorrect results
See my edit for an alternate solution with two patterns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.