2

I have a C like source code, and I am trying to extract and save all the strings in this source code to a list, without including the strings that are in comments. a string in this source code can have any characters, white spaces, and even comments.

Example:

// this is an inline comment with the name "Alpha 1"

string x = "Alpha 2";
/** this is a block comment with the string "Alpha 3" */
foo("Alpha 4");
string y = "Alpha /*  */ 5 // with comments";

Output:

["Alpha 2", "Alpha 4", "Alpha /*  */ 5 // with comments"]

The problem that I can't use regex because I can have comments within a given string (which is valid), and of course I can have strings inside an inline comment or inside a block comment.

I use this method to get all strings inside a code:

re.findall(r'\"(.+?)\"', code)

but It gives me also the strings that are inside comments.

Any Help?

3
  • 2
    Regular expressions are a terrible idea for parsing non-regular languages. Have you considered using an actual lexer/parser, e.g. yacc, lex, flex, bison? Yes, it's higher overhead to learn them, but it's the only way to parse an even moderately complex grammar correctly, without leaving edge cases. Commented Dec 22, 2017 at 18:34
  • 1
    You may need to use lookahead or lookbehind assertions. Try pythex.org for regex matching and tips. Commented Dec 22, 2017 at 18:41
  • @ShadowRanger: Since neither comments nor string constants are arbirtrarily nesting, the language is regular. I agree that using a trivial lexer + an explicit FSM to handle it is much easier than coming up with a regexp, especially if we don't want the entire file in memory as a string. Commented Dec 22, 2017 at 18:51

4 Answers 4

2

If the language is as simple as you describe, I think I'd write the parser by hand. I'd still use a regular expression to tokenize the input.

Here you go:

import re
from itertools import takewhile


def extract_strings(source):
    def consume(it, end):
        return list(takewhile(lambda x: x != end, it))
    tokens = iter(re.split(r'''("|/\*|\*/|//|\n)''', source))
    strings = []
    for token in tokens:
        if token == '"':
            strings.append(''.join(consume(tokens, '"')))
        elif token == '//':
            consume(tokens, '\n')
        elif token == '/*':
            consume(tokens, '*/')
    return strings

data = '''
// this is an inline comment with the name "Alpha 1"

string x = "Alpha 2";
/** this is a block comment with the string "Alpha 3" */
foo("Alpha 4");
string y = "Alpha /*  */ 5 // with comments";
'''
print(extract_strings(data))
Sign up to request clarification or add additional context in comments.

Comments

1
re = r'(?:\/\/.+\n|\/\*.+\*\/)|(\".+\"|\'.+\')'

This should mostly work. Just make just that comments that are like // comment end in a newline. All words that are not in a comment will be returned in capture group one. Be careful, however, there will be None for every comment in the code.

Comments

1

Given:

>>> src='''\
... // this is an inline comment with the name "Alpha 1"
... 
... string x = "Alpha 2";
... /** this is a block comment with the string "Alpha 3" */
... foo("Alpha 4");
... string y = "Alpha /*  */ 5 // with comments";'''

This regex will work:

>>> pat=re.compile(r"(?:\/\/.+$|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)|(\"[^\"]*\"|\'[^']*\')", re.M)
>>> [m.group(1) for m in pat.finditer(src) if m.group(1)]
['"Alpha 2"', '"Alpha 4"', '"Alpha /*  */ 5 // with comments"']

The regex is explained here.

(But using a parser is more robust...)

Comments

0

On the subject of writing a parser, I take this opportunity to write my own using state machine:

State Machine for Extracting Strings

import sys


ASTERISK = '*'
DEFAULT = 'default'
EOL = '\n'
ESCAPE = '\\'
QUOTE = '"'
SLASH = '/'


class ExtractStrings:
    def __init__(self, multiline_string):
        self.buffer = multiline_string
        self.chars_collected = ''
        self.strings = None

    def noop(self, ch):
        pass

    def collect_char(self, ch):
        self.chars_collected += ch

    def return_string(self, ch):
        self.strings.append(self.chars_collected)
        self.chars_collected = ''

    def parse(self):
        self.strings = []
        state = {
            'start': {
                QUOTE: (self.noop, 'in_string'),
                SLASH: (self.noop, 'first_slash'),
                DEFAULT: (self.noop, 'start'),
            },
            'in_string': {
                QUOTE: (self.return_string, 'start'),
                ESCAPE: (self.collect_char, 'escaping'),
                DEFAULT: (self.collect_char, 'in_string'),
            },
            'escaping': {
                DEFAULT: (self.collect_char, 'in_string'),
            },
            'first_slash': {
                SLASH: (self.noop, 'line_comment'),
                ASTERISK: (self.noop, 'block_comment'),
                DEFAULT: (self.noop, 'start'),
            },
            'line_comment': {
                EOL: (self.noop, 'start'),
                DEFAULT: (self.noop, 'line_comment'),
            },
            'block_comment': {
                ASTERISK: (self.noop, 'near_comment_block_end'),
                DEFAULT: (self.noop, 'block_comment'),
            },
            'near_comment_block_end': {
                SLASH: (self.noop, 'start'),
                ASTERISK: (self.noop, 'near_comment_block_end'),
                DEFAULT: (self.noop, 'block_comment'),
            }

        }

        current = 'start'
        for ch in self.buffer:
            default = state[current][DEFAULT]
            action, next_state = state[current].get(ch, default)
            action(ch)
            current = next_state

    def __iter__(self):
        if self.strings is None:
            self.parse()

        return iter(self.strings)

if __name__ == '__main__':
    with open(sys.argv[1]) as f:
        code = f.read()

    for string_literal in ExtractStrings(code):
        print('"%s"' % string_literal)

How does it work?

The state machine defines different states, what to do at each state (not shown in the diagram) and the transitions to the next states. Once the state machine is defined (as a nested dictionary), it is just a matter of perform the action for the state, read the next char and look up the state machine to see which state we should transition to.

The state machine is a nested dictionary. For the outer dictionary, the key is the state name and the value is the inner dictionary. For the inner dictionary, the key is the next char and the value is a tuple of (action, next state).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.