find strings in source code by using regex in perl

Question

I am studying on regular expression in perl.

I want to write a script that accepts a C source code file and finds strings.

This is my code:

my $file1= @ARGV;
open my $fh1, '<', $file1;
while(<>)
{
  @words = split(/\s/, $_);
  $newMsg = join '', @words;
  push  @strings,($newMsg =~ m/"(.*\\*.*\\*.*\\*.*)"/) if($newMsg=~/".*\\*.*\\*.*\\*.*"/);
  print Dumper(\@strings);
foreach(@strings)
    {
    print"strings: $_\n"; 
    }

but i have problem in matching multiple string like this

const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent";

what i must do?

Your first three lines do not work together. You assign a number to $file1 (the size of @ARGV), and your open probably fails silently, but because you do not check the return value you do not notice. Finally you read from the file using the diamond operator, which automatically opens the file. — TLP
– TLP, Commented Aug 25, 2013 at 10:44
To overcome the multiline problem, slurp the file into a single variable. — TLP
– TLP, Commented Aug 25, 2013 at 10:46

amon · Accepted Answer · 2013-08-25 13:17:01Z

Here is a fun solution. It uses MarpaX::Languages::C::AST, an experimental C parser. We can use the c2ast.pl program that ships with the module to convert a piece of C source file to an abstract syntax tree, which we dump to some file (using Data::Dumper). We can then extract all strings with a bit of magic.

Unfortunately, the AST objects have no methods, but as they are autogenerated, we know how they look on the inside.

They are blessed arrayrefs.
- Some contain a single unblessed arrayrefs of items,
- Others contain zero or more items (lexemes or objects)
“Lexemes” are an arrayref with two fields of location information, and the string contents at index 2.

This information can be extracted from the grammar.

The code:

use strict; use warnings;
use Scalar::Util 'blessed';
use feature 'say';

our $VAR1;
require "test.dump"; # populates $VAR1

my @strings = map extract_value($_), find_strings($$VAR1);
say for @strings;

sub find_strings {
  my $ast = shift;
  return $ast if $ast->isa("C::AST::string");
  return map find_strings($_), map flatten($_), @$ast;
}

sub flatten {
  my $thing = shift;
  return $thing if blessed($thing);
  return map flatten($_), @$thing if ref($thing) eq "ARRAY";
  return (); # we are not interested in other references, or unblessed data
}

sub extract_value {
  my $string = shift;
  return unless blessed($string->[0]);
  return unless $string->[0]->isa("C::AST::stringLiteral");
  return $string->[0][0][2];
}

A rewrite of find_strings from recursion to iteration:

sub find_strings {
  my @unvisited = @_;
  my @found;
  while (my $ast = shift @unvisited) {
    if ($ast->isa("C::AST::string")) {
      push @found, $ast;
    } else {
      push @unvisited, map flatten($_), @$ast;
    }
  }
  return @found;
}

The test C code:

/* A "comment" */
#include <stdio.h>

static const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent"; 

int main() {
        printf("Hello %s:\n%s\n", "World", text2);
        return 0;
}

I ran the commands

$ perl $(which c2ast.pl) test.c -dump >test.dump;
$ perl find-strings.pl

Which produced the output

"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"World"
"Hello %s\n"
"" 
"" 
"" 
"" 
"" 
""

Notice how there are some empty strings not from our source code, which come somewhere from the included files. Filtering those out would probably not be impossible, but is a bit impractical.

your answer is so beautiful but i have to use regular expression if you have any idea i will very glad to share it with me

PP. · Accepted Answer · 2013-08-25 10:44:00Z

3

It appears you're trying to use the following regular expression to capture multiple lines in a string:

my $your_regexp = m{
    (
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
    )
}x

But it appears more of a grasp of desperation than a deliberately thought out plan.

So you've got two problems:

find everything between double quotes (")
handle the situation where there might be multiple lines between those quotes

Regular expressions can match across multiple lines. The /s modifier does this. So try:

my $your_new_regexp = m{
    \"       # opening quote mark
    ([^\"]+) # anything that's not a quote mark, capture
    \"       # closing quote mark
}xs;

You might actually have a 3rd problem:

remove trailing backslash/newline pairs from strings

You could handle this by doing a search-replace:

foreach ( @strings ) {
    $_ =~ s/\\\n//g;
}

edited Aug 25, 2013 at 10:44

answered Aug 25, 2013 at 10:38

PP.

10.9k7 gold badges48 silver badges60 bronze badges

7 Comments

User123422 Over a year ago

i am programming in perl language so i add these chages to my cod like this: push @strings,($newMsg =~ m/"([^\"]+)"/) if($newMsg=~/"([^\"]+)"/); but i have problem yet?

User123422 Over a year ago

i am programming in perl language so i add these chages to my cod like this

TLP Over a year ago

Your new regex should probably be qr() and not m(). Also /s will not help in line-by-line reading.

David Knipe Over a year ago

You've written push @strings,($newMsg =~ m/"([^\"]+)"/) if($newMsg=~/"([^\"]+)"/);. I think you want push @strings, $1 if($newMsg=~/"([^"]+)"/);. Running the regex just returns 1 if it finds a match; the captured string goes into the special variable $1.

User123422 Over a year ago

i change my code like this :push @strings,($newMsg =~ qr/"([^\"]+)"/) if($newMsg=~/"([^\"]+)"/); but i have problem yet

|

amon · Accepted Answer · 2013-08-26 09:20:24Z

Here is a simple way of extracting all strings in a source file. There is an important decision we can make: Do we preprocess the code? If not, we may miss some strings if they are generated via macros. We would also have to treat the # as a comment character.

As this is a quick-and-dirty solution, syntactic correctness of the C code is not an issue. We will however honour comments.

Now if the source was pre-processed (with gcc -E source.c), then multiline strings are already folded into one line! Also, comments are already removed. Sweet. The only comments that remain are mention line numbers and source files for debugging purposes. Basically all that we have to do is

$ gcc -E source.c | perl -nE'
  next if /^#/;  # skip line directives etc.
  say $1 while /(" (?:[^"\\]+ | \\.)* ")/xg;
'

Output (with the test file from my other answer as input):

""
"__isoc99_fscanf"
""
"__isoc99_scanf"
""
"__isoc99_sscanf"
""
"__isoc99_vfscanf"
""
"__isoc99_vscanf"
""
"__isoc99_vsscanf"
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"

So yes, there is a lot of garbage here (they seem to come from __asm__ blocks), but this works astonishingly well.

Note the regex I used: /(" (?:[^"\\]+ | \\.)* ")/x. The pattern inside the capture can be explained as

"         # a literal '"'
(?:       # the begin of a non-capturing group
  [^"\\]+ # a character class that matches anything but '"' or '\', repeated once or more
|
  \\.     # an escape sequence like '\n', '\"', '\\' ...
)*        # zero or more times
"         # closing '"'

What are the limitations of this solution?

We need the a preprocessor
- This code was tested with gcc
- clang also supports the -E option, but I have no idea how the output is formatted.
Character literals are a failure mode, e.g. myfunc('"', a_variable, '"') would be extracted as "', a_variable, '".
We also extract strings from other source files. (false positives)

Oh wait, we can fix the last bit by parsing the source file comments which the preprocessor inserted. They look like

# 29 "/usr/include/stdio.h" 2 3 4

So if we remeber the current filename, and compare it to the filename we want, we can skip unwanted strings. This time, I'll write it as a full script instead of a one-liner.

use strict; use warnings;
use autodie;  # automatic error handling
use feature 'say';

my $source = shift @ARGV;
my $string_re = qr/" (?:[^"\\]+ | \\.)* "/x;

# open a pipe from the preprocessor
open my $preprocessed, "-|", "gcc", "-E", $source;

my $file;
while (<$preprocessed>) {
  $file = $1 if /^\# \s+ \d+ \s+ ($string_re)/x;
  next if /^#/;
  next if $file ne qq("$source");
  say $1 while /($string_re)/xg;
}

Usage: $perl extract-strings.pl source.c

This now produces the output:

"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"

If you cannot use the convenient preprocessor to fold multiline strings and remove comments, this gets a lot uglier, because we have to account for all of that ourselves. Basically, you want to slurp in the whole file at once, not iterate it line by line. Then, you skip over any comments. Do not forget to ignore preprocessor directives as well. After that, we can extract the strings as usual. Basically, you have to rewrite the grammar

Start → Comment Start
Start → String Start
Start → Whatever Start
Start → End

to a regex. As the above is a regular language, this isn't too hard.

this regex: while($text=~m/"([^"]+)"/gx) works properly for multiple command line and others but i want to show some string like : cout<<"sara comes frome \"california\". "; do you have any idea?
@LoobiaSabz As you may have realized, I did not use the regex /"([^"]+)"/ especially because of this case. Instead, I used /(" (?:[^"\\]+ | \\.)* ")/x which handles this quite fine.

Collectives™ on Stack Overflow

find strings in source code by using regex in perl

3 Answers 3

1 Comment

7 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related