Java Regex building

Question

I need a help in building the regex for following pattern where I have to collect the string in a particular pattern.

Sample Input String:

*!
hostname ${hostname} !
!
!
ip name-server ${ip-name-server}
no ipv6 cef
!
!
voice class codec 1 
 codec preference 1 ${codec-pref-1}  codec preference 2 ${codec-pref-2}      codec preference 3 ${codec-pref-3} !
!
session target dns:${session-targ-DNS}  dtmf-relay rtp-nte*

The output should be hostname, ip-name-server, codec-pref-1, codec-pref-2, codec-pref-3, session-targ-DNS,

i.e the string which is covered in the format ${string} should be collected and retrieved.

I tried code as below

public void fetchKeyword(String inputString) {  
        String inputString1 = inputString.replace("\n", " ");   
        Pattern p = Pattern.compile("\\${$1} ");
        Matcher m = p.matcher(inputString1);
        int i=0;
        while(m.find()){
            System.out.println(m.group(i));
            i++;
        }
    }

Also I tried patterns likes .${.*}, (.)${.*?} etc but no result came as expected. I got exceptions like below

  Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal repetition near index 1
\${$1} 
 ^
    at java.util.regex.Pattern.error(Unknown Source)
    at java.util.regex.Pattern.closure(Unknown Source)
    at java.util.regex.Pattern.sequence(Unknown Source)
    at java.util.regex.Pattern.expr(Unknown Source)
    at java.util.regex.Pattern.compile(Unknown Source)
    at java.util.regex.Pattern.<init>(Unknown Source)
    at java.util.regex.Pattern.compile(Unknown Source)
    at myUtil.ReplaceString.fetchKeyword(ReplaceString.java:70)
    at myUtil.ReplaceString.main(ReplaceString.java:20)

Can anyone please help on the same?

Note that { and } are special in regular expressions (they allow you to specify limited repetitions, like x{3,5} (x repeated 3 to 5 times)). So you need to escape them as well, not just the $. — RealSkeptic
– RealSkeptic, Commented Sep 21, 2016 at 9:18

Community · Accepted Answer · 2017-05-23 12:15:06Z

You can use this solution to retrieve the placeholder text:

// test string
String input = "! hostname ${hostname} ! ! ! ip name-server "
            + "${ip-name-server} no ipv6 cef ! ! "
            + "voice class codec 1 codec preference 1 ${codec-pref-1} "
            + "codec preference 2 ${codec-pref-2} codec preference 3 "
            + "${codec-pref-3} ! ! session target "
            + "dns:${session-targ-DNS} dtmf-relay rtp-nte";

// compiling pattern with one group representing the text inside ${}
Pattern p = Pattern.compile("\\$\\{(.+?)\\}");
// initializing matcher
Matcher m = p.matcher(input);
// iterating find
while (m.find()) {
    // back-referencing group 1 each find
    System.out.println(m.group(1));
}

Output

hostname
ip-name-server
codec-pref-1
codec-pref-2
codec-pref-3
session-targ-DNS

Notes

The $1 idiom you used is employed in replacements (i.e. String#replaceAll), to back-reference an indexed group.
Indexed groups are declared in your pattern as () or since Java 7, as named groups: (?<name>X)
The index of a group is defined by the occurrence of a grouping idiom within the pattern, not by iteration of matches as you seem to assume
See docs here
The pattern I'm showing as example is double escaping the $, { and } characters
Also worth noting, it uses a reluctant quantifier (+?) in order to match as much as possible until the next known character: }
Finally as stated above, the group #1 is defined within the parenthesis, and represents any character (until the closing })
Line breaks in your input text will not impact negatively on this pattern's results as long as no line break occurs within a ${something} idiom
If such a case occurred, you would either need to clean up the text of line breaks before parsing, or parametrize your pattern with Pattern.DOTALL and cleanup the line breaks in the matches afterwards (the latter doesn't look like a great solution though)
As Thomas mentions, this pattern assumes your expression between {} will never be empty. If you do have an empty expression, it will fail by parsing everything from the start of the empty expression to the end of the next, non-empty one if applicable. So, either you are guaranteed you don't have empty expressions or you want to use .*? instead of .+? (see also Thomas' answer).

Very nice breakdown. One hint on the reluctant quantifier though: if the data would contain an empty tag, e.g. ${}...${b}, the expression would match everything between the start of the first and the end of the next non-empty tag, i.e. }...${b in the example above.

Thomas · Accepted Answer · 2016-09-21 11:12:38Z

2

m.group(i) is not correct. The groups have the same index for each match and are based on the regex. Since you don't have any capturing groups you'll only be able to use index 0 which means the entire match.

Also the back reference $1 can be used in a replacement string but not in the regex and the number is also based on the capturing groups, i.e. $1 would mean group index 1 (which you don't have).

Thus your expression very likely should look like this: \$\{([^}]*)\}

Edit: Note that I used the "any"-quantifier (*) here in order to catch empty tags, i.e. ${}. It is very likely that those represent some kind of error and thus you'll probably want to catch and handle them. If you don't want to do that, i.e. skip those, just use the "at least one"-quantifier (+).

I also used an explicit negative character class ([^}] - everything that's not a right curly brace) instead of a reluctant quantifier like .*? for a simple reason: it's more explicit and thus more readable (in my opinion) and less error-prone.

As an example take the possibility of empty tags in the data and let's say you want to ignore them. Using \$\{[^}]+\} would ignore them while using \$\{.+?\} would include them due to the shortes possible match in ${}...${b} would be }...${b (the regex engine tries to match from left to right).

It's also potentially safer when it comes to catastrophic backtracking (e.g. if you'd add another quantifier). In the simple case you provided that might not be a problem but keep in mind that things like (.+)* might kill your regex engine.

edited Sep 21, 2016 at 11:12

answered Sep 21, 2016 at 9:16

Thomas

88.9k13 gold badges126 silver badges162 bronze badges

5 Comments

Manushi Over a year ago

Thank you very much. Unfortunately, I am able to tick only one answer. This pattern works. Thanks for the explanation as well.

Simon PA Over a year ago

You can upvote it, then it is also marked as "valuable"

Mena Over a year ago

Plus one from me. Although the quantifier in this one will match 0+ character matches, which might validate empty ${}s.

Thomas Over a year ago

@Mena yes that was deliberate so that they are not skipped but could be handled (an empty tag is likely to be a problem/error in most cases so you'd want to catch them too).

Manushi Over a year ago

Yes. The expression \$\{([^}]*)\} works for ${} as well. Generally, the requirement which I work will not have empty values. Anyway handling if in the case will surely help.

Collectives™ on Stack Overflow

Java Regex building

2 Answers 2

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related