Regular Expression for Stripping Strings from Source Code

Question

I'm looking for a regular expression that will replace strings in an input source code with some constant string value such as "string", and that will also take into account escaping the string-start character that is denoted by a double string-start character (e.g. "he said ""hello""").

To clarify, I will provide some examples of input and expected output:

input: print("hello world, how are you?")
output: print("string")

input: print("hello" + "world")
output: print("string" + "string")

# here's the tricky part:
input: print("He told her ""how you doin?"", and she said ""I'm fine, thanks""")
output: print("string")

I'm working in Python, but I guess this is language agnostic.

EDIT: According to one of the answers, this requirement may not be fit for a regular expression. I'm not sure that's true but I'm not an expert. If I try to phrase my requirement with words, what I'm looking for is to find sets of characters that are between double quotes, wherein even groups of adjacent double quotes should be disregarded, and that sounds to me like it can be figured by a DFA.

Thanks.

Could you be more specific. I don't understand at all your question. Why "hello" + "word" would be different than "helloword" for instance? — odwl
– odwl, Commented May 27, 2009 at 10:17
I'm parsing code. If the parser could figure out that "hello"+"world" is identical to "helloworld" that would be a nice bonus, but not a requirement. I hope that clarifies it. — Roee Adler
– Roee Adler, Commented May 27, 2009 at 10:19

Carl Meyer · Accepted Answer · 2009-05-27 22:16:20Z

3

If you're parsing Python code, save yourself the hassle and let the standard library's parser module do the heavy lifting.

If you're writing your own parser for some custom language, it's awfully tempting to start out by just hacking together a bunch of regexes, but don't do it. You'll dig yourself into an unmaintainable mess. Read up on parsing techniques and do it right (wikipedia can help).

This regex does the trick for all three of your examples:

re.sub(r'"(?:""|[^"])+"', '"string"', original)

edited May 27, 2009 at 22:16

answered May 27, 2009 at 15:04

Carl Meyer

127k21 gold badges111 silver badges117 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Roee Adler Over a year ago

I'm not parsing Python and I'm aware of the challenges of parsing. I do not intend to parse using regexes, but only strip the strings before parsing to make my parsing simpler.

Carl Meyer Over a year ago

Fair enough, added a regex which I think does what you need.

Ben Blank Over a year ago

@Carl Meyer — For performance, I'd recommend using non-capturing groups and removing the first quantifier, to prevent the quantifiers from "fighting" in ambiguous cases: r'"(?:""|[^"])+"'

Carl Meyer Over a year ago

Very good points, @Ben Blank, thanks. Editing to include your suggestions.

John Machin Over a year ago

"strip the strings before parsing to make my parsing simpler"?? Your lexer will still need to recognise "string" as a string constant ... so why not handle all the varieties of string constant forms in your lexer? BTW, what about "embedded \" quote" ?

Douglas Leeder · Accepted Answer · 2009-05-27 10:13:18Z

0

Maybe:

re.sub(r"[^\"]\"[^\"].*[^\"]\"[^\"]",'"string"',input)

EDIT:

No that won't work for the final example.

I don't think your requirements are regular: they can't be matched by a regular expression. This is because at the heart of the matter, you need to match any odd number of " grouped together, as that is your delimiter.

I think you'll have to do it manually, counting "s.

answered May 27, 2009 at 10:13

Douglas Leeder

53.5k9 gold badges100 silver badges138 bronze badges

Comments

PAG · Accepted Answer · 2009-05-27 14:28:38Z

0

There's a very good string-matching regular expression over at ActiveState. If it doesn't work straight out for your last example it should be a fairly trivial repeat to group adjacent quoted strings together.

answered May 27, 2009 at 14:28

PAG

1,9561 gold badge18 silver badges19 bronze badges

Collectives™ on Stack Overflow

Regular Expression for Stripping Strings from Source Code

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related