Finding all strings within source code that are not inside comments using Python

Question

I have a C like source code, and I am trying to extract and save all the strings in this source code to a list, without including the strings that are in comments. a string in this source code can have any characters, white spaces, and even comments.

Example:

// this is an inline comment with the name "Alpha 1"

string x = "Alpha 2";
/** this is a block comment with the string "Alpha 3" */
foo("Alpha 4");
string y = "Alpha /*  */ 5 // with comments";

Output:

["Alpha 2", "Alpha 4", "Alpha /*  */ 5 // with comments"]

The problem that I can't use regex because I can have comments within a given string (which is valid), and of course I can have strings inside an inline comment or inside a block comment.

I use this method to get all strings inside a code:

re.findall(r'\"(.+?)\"', code)

but It gives me also the strings that are inside comments.

Any Help?

Regular expressions are a terrible idea for parsing non-regular languages. Have you considered using an actual lexer/parser, e.g. yacc, lex, flex, bison? Yes, it's higher overhead to learn them, but it's the only way to parse an even moderately complex grammar correctly, without leaving edge cases. — ShadowRanger
– ShadowRanger, Commented Dec 22, 2017 at 18:34
You may need to use lookahead or lookbehind assertions. Try pythex.org for regex matching and tips. — user1329187
– user1329187, Commented Dec 22, 2017 at 18:41
@ShadowRanger: Since neither comments nor string constants are arbirtrarily nesting, the language is regular. I agree that using a trivial lexer + an explicit FSM to handle it is much easier than coming up with a regexp, especially if we don't want the entire file in memory as a string. — 9000
– 9000, Commented Dec 22, 2017 at 18:51

Robᵩ · Accepted Answer · 2017-12-22 19:27:43Z

2

If the language is as simple as you describe, I think I'd write the parser by hand. I'd still use a regular expression to tokenize the input.

Here you go:

import re
from itertools import takewhile


def extract_strings(source):
    def consume(it, end):
        return list(takewhile(lambda x: x != end, it))
    tokens = iter(re.split(r'''("|/\*|\*/|//|\n)''', source))
    strings = []
    for token in tokens:
        if token == '"':
            strings.append(''.join(consume(tokens, '"')))
        elif token == '//':
            consume(tokens, '\n')
        elif token == '/*':
            consume(tokens, '*/')
    return strings

data = '''
// this is an inline comment with the name "Alpha 1"

string x = "Alpha 2";
/** this is a block comment with the string "Alpha 3" */
foo("Alpha 4");
string y = "Alpha /*  */ 5 // with comments";
'''
print(extract_strings(data))

edited Dec 22, 2017 at 19:27

answered Dec 22, 2017 at 19:12

Robᵩ

170k20 gold badges251 silver badges323 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ryan Schaefer · Accepted Answer · 2017-12-22 18:43:21Z

1

re = r'(?:\/\/.+\n|\/\*.+\*\/)|(\".+\"|\'.+\')'

This should mostly work. Just make just that comments that are like // comment end in a newline. All words that are not in a comment will be returned in capture group one. Be careful, however, there will be None for every comment in the code.

answered Dec 22, 2017 at 18:43

Ryan Schaefer

3,1101 gold badge29 silver badges48 bronze badges

Comments

dawg · Accepted Answer · 2017-12-22 20:18:02Z

1

Given:

>>> src='''\
... // this is an inline comment with the name "Alpha 1"
... 
... string x = "Alpha 2";
... /** this is a block comment with the string "Alpha 3" */
... foo("Alpha 4");
... string y = "Alpha /*  */ 5 // with comments";'''

This regex will work:

>>> pat=re.compile(r"(?:\/\/.+$|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)|(\"[^\"]*\"|\'[^']*\')", re.M)
>>> [m.group(1) for m in pat.finditer(src) if m.group(1)]
['"Alpha 2"', '"Alpha 4"', '"Alpha /*  */ 5 // with comments"']

The regex is explained here.

(But using a parser is more robust...)

answered Dec 22, 2017 at 20:18

dawg

105k24 gold badges142 silver badges217 bronze badges

Comments

Hai Vu · Accepted Answer · 2017-12-23 04:21:48Z

On the subject of writing a parser, I take this opportunity to write my own using state machine:

import sys


ASTERISK = '*'
DEFAULT = 'default'
EOL = '\n'
ESCAPE = '\\'
QUOTE = '"'
SLASH = '/'


class ExtractStrings:
    def __init__(self, multiline_string):
        self.buffer = multiline_string
        self.chars_collected = ''
        self.strings = None

    def noop(self, ch):
        pass

    def collect_char(self, ch):
        self.chars_collected += ch

    def return_string(self, ch):
        self.strings.append(self.chars_collected)
        self.chars_collected = ''

    def parse(self):
        self.strings = []
        state = {
            'start': {
                QUOTE: (self.noop, 'in_string'),
                SLASH: (self.noop, 'first_slash'),
                DEFAULT: (self.noop, 'start'),
            },
            'in_string': {
                QUOTE: (self.return_string, 'start'),
                ESCAPE: (self.collect_char, 'escaping'),
                DEFAULT: (self.collect_char, 'in_string'),
            },
            'escaping': {
                DEFAULT: (self.collect_char, 'in_string'),
            },
            'first_slash': {
                SLASH: (self.noop, 'line_comment'),
                ASTERISK: (self.noop, 'block_comment'),
                DEFAULT: (self.noop, 'start'),
            },
            'line_comment': {
                EOL: (self.noop, 'start'),
                DEFAULT: (self.noop, 'line_comment'),
            },
            'block_comment': {
                ASTERISK: (self.noop, 'near_comment_block_end'),
                DEFAULT: (self.noop, 'block_comment'),
            },
            'near_comment_block_end': {
                SLASH: (self.noop, 'start'),
                ASTERISK: (self.noop, 'near_comment_block_end'),
                DEFAULT: (self.noop, 'block_comment'),
            }

        }

        current = 'start'
        for ch in self.buffer:
            default = state[current][DEFAULT]
            action, next_state = state[current].get(ch, default)
            action(ch)
            current = next_state

    def __iter__(self):
        if self.strings is None:
            self.parse()

        return iter(self.strings)

if __name__ == '__main__':
    with open(sys.argv[1]) as f:
        code = f.read()

    for string_literal in ExtractStrings(code):
        print('"%s"' % string_literal)

How does it work?

The state machine defines different states, what to do at each state (not shown in the diagram) and the transitions to the next states. Once the state machine is defined (as a nested dictionary), it is just a matter of perform the action for the state, read the next char and look up the state machine to see which state we should transition to.

The state machine is a nested dictionary. For the outer dictionary, the key is the state name and the value is the inner dictionary. For the inner dictionary, the key is the next char and the value is a tuple of (action, next state).

Collectives™ on Stack Overflow

Finding all strings within source code that are not inside comments using Python

4 Answers 4

Comments

Comments

Comments

How does it work?

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

How does it work?

Comments

Your Answer

Sign up or log in

Post as a guest

Related