8

I'm working on a regular expression to recognize variable declarations in C and I have got this.

[a-zA-Z_][a-zA-Z0-9]*

Is there any better solution?

5
  • Sweet! It seems solid, just felt I might be missing something :P Commented Oct 20, 2012 at 22:26
  • 5
    That might match an identifier but not a declaration. Regular expressions are not going to be the best tool for parsing anything but the simplest declarations. Commented Oct 20, 2012 at 22:42
  • @JohnWoo I thought of something! What about length? What's the max length on a variable name? 32? How would that work? {0,31} something like that on the end? Commented Oct 20, 2012 at 22:43
  • @JohnWoo That's the ticket, thanks again. I think that's points worthy if you want to post as an answer. :P Commented Oct 20, 2012 at 22:47
  • @JohnWoo lol just edited my comment ;p Go for it! Commented Oct 20, 2012 at 22:49

7 Answers 7

12

A pattern to recognize variable declarations in C. Looking at a conventional declaration, we see:

int variable;

If that's the case, one should test for the type keyword before anything, to avoid matching something else, like a string or a constant defined with the preprocessor

(?:\w+\s+)([a-zA-Z_][a-zA-Z0-9]+)

variable name resides in \1.

The feature you need is look-behind/look-ahead.

UPDATE July 11 2015

The previous regex fail to match some variables with _ anywhere in the middle. To fix that, one just have to add the _ to the second part of the first capture group, it also assume variable names of two or more characters, this is how it looks after the fix:

(?:\w+\s+)([a-zA-Z_][a-zA-Z0-9_]*)

However, this regular expression has many false positives, goto jump; being one of them, frankly it's not suitable for the job, because of that, I decided to create another regex to cover a wider range of cases, though it's far from perfect, here it is:

\b(?:(?:auto\s*|const\s*|unsigned\s*|signed\s*|register\s*|volatile\s*|static\s*|void\s*|short\s*|long\s*|char\s*|int\s*|float\s*|double\s*|_Bool\s*|complex\s*)+)(?:\s+\*?\*?\s*)([a-zA-Z_][a-zA-Z0-9_]*)\s*[\[;,=)]

I've tested this regex with Ruby, Python and JavaScript and it works very well for the common cases, however it fails in some cases. Also, the regex may need some optimizations, though it is hard to do optimizations while maintaining portability across several regex engines.

Tests resume

unsignedchar *var;                   /* OK, doesn't match */
goto **label;                        /* OK, doesn't match */
int function();                      /* OK, doesn't match */
char **a_pointer_to_a_pointer;       /* OK, matches +a_pointer_to_a_pointer+ */
register unsigned char *variable;    /* OK, matches +variable+ */
long long factorial(int n)           /* OK, matches +n+ */
int main(int argc, int *argv[])      /* OK, matches +argc+ and +argv+ (needs two passes) */
const * char var;                    /* OK, matches +var+, however, it doesn't consider +const *+ as part of the declaration */
int i=0, j=0;                        /* 50%, matches +i+ but it will not match j after the first pass */
int (*functionPtr)(int,int);         /* FAIL, doesn't match (too complex) */

False positives

The following case is hard to cover with a portable regular expression, text editors use contexts to avoid highlighting text inside quotes.

printf("int i=%d", i);               /* FAIL, match i inside quotes */

False positives (syntax errors)

This can be fixed if one test the syntax of the source file before applying the regular expression. With GCC and Clang one can just pass the -fsyntax-only flag to test the syntax of a source file without compiling it

int char variable;                  /* matches +variable+ */
Sign up to request clarification or add additional context in comments.

1 Comment

@themadmax you were right, thanks for letting me know about this. I updated the answer with a regular expression that covers a wider range of cases.
5
[a-zA-Z_][a-zA-Z0-9_]{0,31} 

This will allow you to have variable names as "m_name" validated.

Comments

2

[a-zA-Z_][a-zA-Z0-9_]* should be the answer according to me

Comments

1

This will eliminate return and typedef false flags. It is capable of capturing the return type and variable name, and supports pointers and arrays. It also eliminates commented code further reducing false flags in addition to detecting typedef variables.

^\s*(?!return |typedef )((\w+\s*\*?\s+)+)+(\w+)(\[\w*\])?(\s*=|;)

Comments

1

I designed this string for matching in regex in my assignment:

(_?[a-zA-Z0-9_]+)(\s+)(([a-zA-Z]?_[a-zA-Z])?([a-zA-Z]*[0-9]*_*)(=[a-zA-Z0-9]*)?[,]?)+((\s+)(([a-zA-Z]?_[a-zA-Z])?([a-zA-Z]*[0-9]*_*)(=[a-zA-Z0-9]*)?[,]?))*

It matches all declarations including using namespace std. So, you need to remove keywords before checking group1 for datatype. If the datatype is valid, you can remove group1 string, and will be left with only comma separated variables, including ones with assignment operators.

The following code is matched properly too:

int a=3, b=9, c, N

Even int i=0 is matched in a for loop: for(int i=0; i<N; ++i)

This regex string does require you to do more work in filtering (like removing keywords before checking) but in turn, matches in cases where other regex strings fail. EDIT: Forgot to mention that it detects _ and all combinations of alphanumeric declarations too.

EDIT2: Slight modification to regex string:

([a-zA-Z0-9_]+)(\\s+)(([a-zA-Z_\\*]?[a-zA-Z0-9_]*(=[a-zA-Z0-9]+)?)[,;]?((\\s*)[a-zA-Z_\\*]?[a-zA-Z0-9_]*?(=[a-zA-Z0-9]+)?[,;])*)

This matches all variable and method declarations. So, all you need to do is check if reg_match->str(1) is a datatype or not. If yes, you can use sregex_token_iterator (with a regex separator of (\\s*)[,.;](\\s*)) on reg_match->str(3) to get all user defined identifiers.

Comments

0

Improved version of answer by Zac Howard-Smith with removed trailing whitespaces, supports multiple variables with comma delimiter and optional type definition.

^[ \t]*(?!return|typedef)((\w+[ \t,]*\*?[ \t,]+)+)*(\w+)(\[\w*\])?([ \t,]*=|;)

Comments

-2

This is the complete form of a variable name.

Just a little modification done for which you can use multiple _ if you wish.

([a-zA-Z_][a-zA-Z0-9]*)*

1 Comment

Downvoted, because it also matches an empty string. This should be a comment BTW it does not answer the question because there is no explanation, just a try this suggestion.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.