0

I'm looking for a regular expression that will replace strings in an input source code with some constant string value such as "string", and that will also take into account escaping the string-start character that is denoted by a double string-start character (e.g. "he said ""hello""").

To clarify, I will provide some examples of input and expected output:

input: print("hello world, how are you?")
output: print("string")

input: print("hello" + "world")
output: print("string" + "string")

# here's the tricky part:
input: print("He told her ""how you doin?"", and she said ""I'm fine, thanks""")
output: print("string")

I'm working in Python, but I guess this is language agnostic.

EDIT: According to one of the answers, this requirement may not be fit for a regular expression. I'm not sure that's true but I'm not an expert. If I try to phrase my requirement with words, what I'm looking for is to find sets of characters that are between double quotes, wherein even groups of adjacent double quotes should be disregarded, and that sounds to me like it can be figured by a DFA.

Thanks.

2
  • Could you be more specific. I don't understand at all your question. Why "hello" + "word" would be different than "helloword" for instance? Commented May 27, 2009 at 10:17
  • I'm parsing code. If the parser could figure out that "hello"+"world" is identical to "helloworld" that would be a nice bonus, but not a requirement. I hope that clarifies it. Commented May 27, 2009 at 10:19

3 Answers 3

3

If you're parsing Python code, save yourself the hassle and let the standard library's parser module do the heavy lifting.

If you're writing your own parser for some custom language, it's awfully tempting to start out by just hacking together a bunch of regexes, but don't do it. You'll dig yourself into an unmaintainable mess. Read up on parsing techniques and do it right (wikipedia can help).

This regex does the trick for all three of your examples:

re.sub(r'"(?:""|[^"])+"', '"string"', original)
Sign up to request clarification or add additional context in comments.

5 Comments

I'm not parsing Python and I'm aware of the challenges of parsing. I do not intend to parse using regexes, but only strip the strings before parsing to make my parsing simpler.
Fair enough, added a regex which I think does what you need.
@Carl Meyer — For performance, I'd recommend using non-capturing groups and removing the first quantifier, to prevent the quantifiers from "fighting" in ambiguous cases: r'"(?:""|[^"])+"'
Very good points, @Ben Blank, thanks. Editing to include your suggestions.
"strip the strings before parsing to make my parsing simpler"?? Your lexer will still need to recognise "string" as a string constant ... so why not handle all the varieties of string constant forms in your lexer? BTW, what about "embedded \" quote" ?
0

Maybe:

re.sub(r"[^\"]\"[^\"].*[^\"]\"[^\"]",'"string"',input)

EDIT:

No that won't work for the final example.

I don't think your requirements are regular: they can't be matched by a regular expression. This is because at the heart of the matter, you need to match any odd number of " grouped together, as that is your delimiter.

I think you'll have to do it manually, counting "s.

Comments

0

There's a very good string-matching regular expression over at ActiveState. If it doesn't work straight out for your last example it should be a fairly trivial repeat to group adjacent quoted strings together.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.