May someone vote up my own answer to my own question.
Thanks to Martin York's idea, I found that in Visual Studio, the solution looks very simple (subject to further testing). Just rename ALL preprocessor directives to something else, (something that is not valid c++ syntax is ok) and use the cl.exe with /P
cl target.cpp /P
and it produces a target.i. And it contains the source minus the comments. Just rename the previous directives back and there you go. Probably you will need to remove the #line directive generated by cl.exe.
This works because according to MSDN, the phases of translation is this:
Character mapping
Characters in the source file are mapped to the internal source representation. Trigraph sequences are converted to single-character internal representation in this phase.
Line splicing
All lines ending in a backslash () and immediately followed by a newline character are joined with the next line in the source file forming logical lines from the physical lines. Unless it is empty, a source file must end in a newline character that is not preceded by a backslash.
Tokenization
The source file is broken into preprocessing tokens and white-space characters. Comments in the source file are replaced with one space character each. Newline characters are retained.
Preprocessing
Preprocessing directives are executed and macros are expanded into the source file. The #include statement invokes translation starting with the preceding three translation steps on any included text.
Character-set mapping
All source character set members and escape sequences are converted to their equivalents in the execution character set. For Microsoft C and C++, both the source and the execution character sets are ASCII.
String concatenation
All adjacent string and wide-string literals are concatenated. For example, "String " "concatenation" becomes "String concatenation".
Translation
All tokens are analyzed syntactically and semantically; these tokens are converted into object code.
Linkage
All external references are resolved to create an executable program or a dynamic-link library
Comments are removed during Tokenization prior to Preprocessing phase. So just make sure during the preprocessing phase, nothing is available
for processing (removing all the directives) and its output should be just those processed by the previous 3 phases.
As to the user-defined .h files, use the /FI option to manually include them. The resultant .i file will be a combination of the .cpp and .h. without comments. Each piece is preceded by a #line with the proper filename. So it is easy to split them up by an editor. If we don't want to manually split them up, probably we need to use the macro/scripting facility of some editors to do it automatically.
So, now, we don't have to care about any of the preprocessor directives. Even better is line continuation character (backslash) is handled.
e.g.
// vc8.cpp : Defines the entry point for the console application.
//
-#include "stdafx.h"
-#include <windows.h>
-#define NOERR
-#ifdef NOERR
/* comment here */
whatever error line is ok
-#else
some error line if NOERR not defined
// comment here
-#endif
void pr() ;
int _tmain(int argc, _TCHAR* argv[])
{
pr();
return 0;
}
/*comment*/
void pr() {
printf(" /* "); /* comment inside string " */
// comment terminated by \
continue a comment line
printf(" "); /** " " string inside comment */
printf/* this is valid comment within line continuation */\
("some weird lines \
with line continuation");
}
After cl.exe vc8.cpp /P, it becomes this, and can then be fed to cl.exe again after restoring the directives (and removing the #line)
#line 1 "vc8.cpp"
-#include "stdafx.h"
-#include <windows.h>
-#define NOERR
-#ifdef NOERR
whatever error line is ok
-#else
some error line if NOERR not defined
-#endif
void pr() ;
int _tmain(int argc, _TCHAR* argv[])
{
pr();
return 0;
}
void pr() {
printf(" /* ");
printf(" ");
printf\
("some weird lines \
with line continuation");
}
regexmean that you want to do it with regexp alone?