0

I need to create a regular expression able to find and get valid identifiers in Java code like this:

int a, b, c;
float d, e;
a = b = 5;
c = 6;
if ( a > b)
{
c = a - b;
e = d - 2.0;
}
else
{
d = e + 6.0;
b = a + c;
}

I have tried to add multiple regexes in a single regex, but how can I build a pattern to exclude reserved words?

I tried this regex ^(((&&|<=|>=|<|>|!=|==|&|!)|([-+=]{1,2})|([.!?)}{;,(-]))|(else|if|float|int)|(\d[\d.])) but it does not work as expected.

Online demo

In the following picture, how should I match for identifiers?

enter image description here

1 Answer 1

4

A Java valid identifier is:

  1. having at least one character
  2. the first character MUST be a letter [a-zA-Z], underscore _, or dollar sign $
  3. the rest of the characters MAY be letters, digits, underscores, or dollar signs
  4. reserved words MUST not be used as identifiers
  5. Update: as single underscore _ is a keyword since Java 9

A naive regexp to validate the first three conditions would be as follows: (\b([A-Za-z_$][$\w]*)\b) but it does not filter out the reserved words.

To exclude the reserved words, negative look-ahead (?!) is needed to specify a group of tokens that cannot match: \b(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*):

  • Group #1: (?!(_\b|if|else|for|float|int)) excludes the list of the specified words
  • Group #2: ([A-Za-z_$][$\w]*) matches identifiers.

However, word border \b consumes dollar sign $, so this regular expression fails to match identifies starting with $.
Also, we may want to exclude matching inside string and character literals ("not_a_variable", 'c', '\u65').

This can be done using positive lookbehind (?<=) to match a group before main expression without including it in the result instead of the word-border class \b: (?<=[^$\w'"\\])(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*)

Online demo for a short list of reserved words

Next, the full list of the Java reserved words is as follows, which can be collected into a single String of tokens separated with |.

A test class showing the final pattern for regular expression and its usage to detect the Java identifiers is provided below.

import java.util.Arrays;
import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;

public class IdFinder {

    static final List<String> RESERVED = Arrays.asList(
        "abstract", "assert", "boolean", "break", "byte", "case", "catch", "char", "class", "const",
        "continue", "default", "double", "do", "else", "enum", "extends", "false", "final", "finally",
        "float", "for", "goto", "if", "implements", "import", "instanceof", "int", "interface", "long",
        "native", "new", "null", "package", "private", "protected", "public", "return", "short", "static",
        "strictfp", "super", "switch", "synchronized", "this", "throw", "throws", "transient", "true", "try",
        "void", "volatile", "while", "_\\b"
    );

    static final String JAVA_KEYWORDS = String.join("|", RESERVED);

    static final Pattern VALID_IDENTIFIERS = Pattern.compile(
            "(?<=[^$\\w'\"\\\\])(?!(" + JAVA_KEYWORDS + "))([A-Za-z_$][$\\w]*)");

    public static void main(String[] args) {
        System.out.println("ID pattern:\n" + VALID_IDENTIFIERS.pattern());

        String code = "public class Main {\n\tstatic int $1;\n\tprotected char _c0 = '\\u65';\n\tprivate long c1__$$;\n}";

        System.out.println("\nIdentifiers in the following code:\n=====\n" + code + "\n=====");

        VALID_IDENTIFIERS.matcher(code).results()
                         .map(MatchResult::group)
                         .forEach(System.out::println);
    }
}

Output

ID pattern:
(?<=[^$\w'"\\])(?!(abstract|assert|boolean|break|byte|case|catch|char|class|const|continue|default|double|do|else|enum|extends|false|final|finally|float|for|goto|if|implements|import|instanceof|int|interface|long|native|new|null|package|private|protected|public|return|short|static|strictfp|super|switch|synchronized|this|throw|throws|transient|true|try|void|volatile|while|_\b))([A-Za-z_$][$\w]*)

Identifiers in the following code:
=====
public class Main {
    static int $1;
    protected char _c0 = '\u65';
    private long c1__$$;
}
=====
Main
$1
_c0
c1__$$
Sign up to request clarification or add additional context in comments.

2 Comments

I thought Java allowed Unicode characters as part of identifiers, i.e. `Jävä' would be a valid identifier?
@knittl yes, however that question was initially about attempt to detect usual ASCII-based identifiers. Also it does not seem to be a good idea to mix in non-ASCII letters into identifiers because the code would turn into a nightmare to maintain.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.