Here is a simple way of extracting all strings in a source file. There is an important decision we can make: Do we preprocess the code? If not, we may miss some strings if they are generated via macros. We would also have to treat the # as a comment character.
As this is a quick-and-dirty solution, syntactic correctness of the C code is not an issue. We will however honour comments.
Now if the source was pre-processed (with gcc -E source.c), then multiline strings are already folded into one line! Also, comments are already removed. Sweet. The only comments that remain are mention line numbers and source files for debugging purposes. Basically all that we have to do is
$ gcc -E source.c | perl -nE'
next if /^#/; # skip line directives etc.
say $1 while /(" (?:[^"\\]+ | \\.)* ")/xg;
'
Output (with the test file from my other answer as input):
""
"__isoc99_fscanf"
""
"__isoc99_scanf"
""
"__isoc99_sscanf"
""
"__isoc99_vfscanf"
""
"__isoc99_vscanf"
""
"__isoc99_vsscanf"
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"
So yes, there is a lot of garbage here (they seem to come from __asm__ blocks), but this works astonishingly well.
Note the regex I used: /(" (?:[^"\\]+ | \\.)* ")/x. The pattern inside the capture can be explained as
" # a literal '"'
(?: # the begin of a non-capturing group
[^"\\]+ # a character class that matches anything but '"' or '\', repeated once or more
|
\\. # an escape sequence like '\n', '\"', '\\' ...
)* # zero or more times
" # closing '"'
What are the limitations of this solution?
- We need the a preprocessor
- This code was tested with
gcc
clang also supports the -E option, but I have no idea how the output is formatted.
- Character literals are a failure mode, e.g.
myfunc('"', a_variable, '"') would be extracted as "', a_variable, '".
- We also extract strings from other source files. (false positives)
Oh wait, we can fix the last bit by parsing the source file comments which the preprocessor inserted. They look like
# 29 "/usr/include/stdio.h" 2 3 4
So if we remeber the current filename, and compare it to the filename we want, we can skip unwanted strings. This time, I'll write it as a full script instead of a one-liner.
use strict; use warnings;
use autodie; # automatic error handling
use feature 'say';
my $source = shift @ARGV;
my $string_re = qr/" (?:[^"\\]+ | \\.)* "/x;
# open a pipe from the preprocessor
open my $preprocessed, "-|", "gcc", "-E", $source;
my $file;
while (<$preprocessed>) {
$file = $1 if /^\# \s+ \d+ \s+ ($string_re)/x;
next if /^#/;
next if $file ne qq("$source");
say $1 while /($string_re)/xg;
}
Usage: $perl extract-strings.pl source.c
This now produces the output:
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"
If you cannot use the convenient preprocessor to fold multiline strings and remove comments, this gets a lot uglier, because we have to account for all of that ourselves. Basically, you want to slurp in the whole file at once, not iterate it line by line. Then, you skip over any comments. Do not forget to ignore preprocessor directives as well. After that, we can extract the strings as usual. Basically, you have to rewrite the grammar
Start → Comment Start
Start → String Start
Start → Whatever Start
Start → End
to a regex. As the above is a regular language, this isn't too hard.