2

I'm trying to extract variables from code statements and "if" condition. I have a regex to that but mymatcher.find() doesn't return any values matched. I don't know what is wrong.

here is my code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class test {
    public static void main(String[] args) {
        String test="x=y+z/n-10+my5th_integer+201";
        Pattern mypattern = Pattern.compile("^[a-zA-Z_$][a-zA-Z_$0-9]*$");
        Matcher mymatcher = mypattern.matcher(test);    
        while (mymatcher.find()) {
            String find = mymatcher.group(1) ;
            System.out.println("variable:" + find);
        }
    }
}
2
  • The second character in the string is = but your regex does not allow for any =. Your regex also does not have any groups. Commented Oct 25, 2015 at 15:58
  • i don't know how ' =' will affect the regex since i'm totally new in regex . i used .start() and .end() , but also not working, in this example, i expected x,y,z,n,and my5th_integer to be the result since they are variables Commented Oct 25, 2015 at 16:03

2 Answers 2

3

You need to remove ^ and $ anchors that assert positions at start and end of string repectively, and use mymatcher.group(0) instead of mymatcher.group(1) because you do not have any capturing groups in your regex:

String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("[a-zA-Z_$][a-zA-Z_$0-9]*");
Matcher mymatcher = mypattern.matcher(test);    
while (mymatcher.find()) {
    String find = mymatcher.group(0) ;
    System.out.println("variable:" + find);
}

See IDEONE demo, the results are:

variable:x
variable:y
variable:z
variable:n
variable:my5th_integer
Sign up to request clarification or add additional context in comments.

Comments

3

Usually processing source code with just a regex simply fails.

If all you want to do is pick out identifiers (we discuss variables further below) you have some chance with regular expressions (after all, this is how lexers are built).

But you probably need a much more sophisticated version than what you have, even with corrections as suggested by other authors.

A first problem is that if you allow arbitrary statements, they often have keywords that look like identifiers. In your specific example, "if" looks like an identifier. So your matcher either has to recognize identifier-like substrings, and subtract away known keywords, or the regex itself must express the idea that an identifier has a basic shape but not cannot look like a specific list of keywords. (The latter is called a subtractive regex, and aren't found in most regex engines. It looks something like:

 [a-zA-Z_$][a-zA-Z_$0-9]* - (if | else | class | ... )

Our DMS lexer generator [see my bio] has subtractive regex because this is extremely useful in language-lexing).

This gets more complex if the "keywords" are not always keywords, that is, they can be keywords only in certain contexts. The Java "keyword" enum is just that: if you use it in a type context, it is a keyword; otherwise it is an identifier; C# is similar. Now the only way to know if a purported identifier is a keyword is to actually parse the code (which is how you detect the context that controls its keyword-ness).

Next, identifiers in Java allow a variety of Unicode characters (Latin1, Russian, Chinese, ...) A regexp to recognize this, accounting for all the characters, is a lot bigger than the simple "A-Z" style you propose.

For Java, you need to defend against string literals containing what appear to be variable names. Consider the (funny-looking but valid) statement:

a =  "x=y+z/n-10+my5th_integer+201";

There is only one identifier here. A similar problem occurs with comments that contain content that look like statements:

/* Tricky:
   a =  "x=y+z/n-10+my5th_integer+201";
*/

For Java, you need to worry about Unicode escapes, too. Consider this valid Java statement:

\u0061 = \u0062; //  means  "a=b;"

or nastier:

a\u006bc = 1; //  means "akc=1;" not "abc=1;"!

Pushing this, without Unicode character decoding, you might not even notice a string. The following is a variant of the above:

a =  \u0042x=y+z/n-10+my5th_integer+201";

To extract identifiers correctly, you need to build (or use) the equivalent of a full Java lexer, not just a simple regex match.

If you don't care about being right most of the time, you can try your regex. Usually regex-applied-to-source-code-parsing ends badly, partly because of the above problems (e.g, oversimplification).

You are lucky in that you are trying to do for Java. If you had to do this for C#, a very similar language, you'd have to handle interpolated strings, which allow expressions inside strings. The expressions themselves can contain strings... its turtles all the way down. Consider the C# (version 6) statement:

a  = $"x+{y*$"z=${c /* p=q */}"[2]}*q" + b;

This contains the identifiers a, b, c and y. Every other "identifier" is actually just a string or comment character. PHP has similar interpolated strings.

To extract identifiers from this, you need a something that understands the nesting of string elements. Lexers usually don't do recursion (Our DMS lexers handle this, for precisely this reason), so to process this correctly you usually need a parser, or at least something that tracks nesting.

You have one other issue: do you want to extract just variable names? What if the identifier represents a method, type, class or package? You can't figure this out without having a full parser and full Java name and type resolution, and you have to do this in the context in which the statement is found. You'd be amazed how much code it takes to do this right.

So, if your goals are simpleminded and you don't care if it handles these complications, you can get by with a simple regex to pick out things that look like identifiers.

If you want to it well (e.g., use this in some production code) the single regex will be total disaster. You'll spend your life explaining to users what they cannot type, and that never works.

Summary: because of all the complications, usually processing source code with just a regex simply fails. People keep re-learning this lesson. It is one of key reasons that lexer generators are widely used in language processing tools.

4 Comments

what a nice explanation, and you're definitely right, but for my purpose, i just need the above regex since i'm working on data flow analysis on pseudo code and i just need to extract variables names , the only keywords i'd care about are just: if , else, elseif, true, and false, so i can easily handle them.
Well, you did say Java :-} If you want to do data flow analysis (not clear on how that is useful in psuedocode), you still need to parse and do name resolution. See my essay on "Life After Parsing" (Google or via bio). Fundamentally we're back to "you intend to build a toy (for educational purposes)" which means any cheat is OK as long as you understand you are cheating, or "you want to build a serious tool", at which point you will need to build/use serious infrastructure.
well, i guess i'll need to explain the whole research project so that it can be clear to you. it's , as i did mentioned, implemented in java and involves passing a pseudo code as a string at some point, and i assure you dude, what i posted is just what i needed.
Generally psuedo code is successful precisely because it has somewhat vague semantics; hard to do accurate dataflow on such. And, I've never seen anybody do serious dataflow directly on text strings; you need all the code structure. Yes, it seems exceeding strange. Yes, I'd like to know what the research project is; I can always learn something new.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.