1

I've been looking for a simple way to find quoted strings of text within a Java source code file. First, I looked to regular expressions. Then I realized I had two problems, because as this answer stated, there isn't going to be a totally correct regex for this, similar to the problems that arise with markup languages. The main issue comes from the fact that there may be escaped quotation marks within a string.

So, what options do I have for parsing a source code file to find strings (possibly with escaped quotations) within? Is there anything that already exists for doing this? Preferably, it would be in Python.

EDIT: Here's some oversimplified example code.

private static String[] b = {
    foo("HG@\"rND"),
    foo("K1\\"),
    bar("ab\\\\\\\"")
}

Any combination of backslashes should be able to be handled. The desired output would be the strings themself.

2
  • Your best bet would be to write a Parser, using something like pyparsing Commented Jan 24, 2014 at 6:14
  • Post up an example string, with your desired output, and I will try my best. Commented Jan 24, 2014 at 6:19

4 Answers 4

1

You can use something like this:

import re

with open('input.java') as jfile:
    text = "".join(x.strip() for x in jfile)
m = re.findall(r'".*?(?<!\\)"', text)
for x in m:
    print x

But it is also necessary to remove comments, which is not extremely difficult. Or look at a Java parser.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the parser link, that's great. Unfortunately I imagine that would be slower than glancing at a file and looking for strings only. There are a few hundred decompiled .java files that I'm looking through, and each one is quite large. Being fast and lightweight is key.
1

Detect the escape sequence and quotes combination \" and replace it with some other combination. Its simple then extracting other stuffs inside the quotes

6 Comments

foo("K1\\") would fail under this condition
First replace the even number of \\ with some string. Then you will be left out with only single escape sequence. Then detect for \"
Good call. It would be possible to use, for any arbitrary valid string, an invalid escape sequence of something like (quote) and (backslash) instead, run the regex, and replace those with the correct values.
I would always replace \\ with @~ and after finishing, will replace again with \\
But there's no guarantee that specific string doesn't show up somewhere else. I'm dealing with a file that has all its strings encrypted via some odd XOR scheme, and I wouldn't be at all surprised if that showed up somewhere. Better to err on the side of caution with an invalid escape sequence.
|
1

What about writing a simple state machine? A simple example (with only double-quoted strings) could be:

STATE_OUTSTRING = 0
STATE_INSTRING = 1
STATE_INSTRINGBACKSLASH = 2

def getstrings(text):
    state = 0
    strings = []
    curstring = None
    for c in text:
        if state == STATE_OUTSTRING:
            if c == '"':
                state = STATE_INSTRING
                curstring = ""
        elif state == STATE_INSTRING:
            if c == '\\':
                state = STATE_INSTRINGBACKSLASH
            elif c == '"':
                state = STATE_OUTSTRING
                strings.append(curstring)
                curstring = None
            else:
                curstring += c
        else: # STATE_INSTRINGBACKSLASH
            curstring += c
            state = STATE_INSTRING
    return strings

You could add states like STATE_INCOMMENT, for example, if needed.

Comments

0

Since this is a simple one, you're probably looking for something more advanced than

("(?:\\"|.)*")

Expl.: The \\" will eat up any escaped quotes, otherwise match any number of characters between two quotes.

Haven't tried the other answers, so there may already be a correct answer here, but anyway...

Regards

Edit: Fix for "flaw"??? Simply "eating" all escaped backslashes seems to do the trick:

("(?:\\"|\\\\|.)*?")

Edit again ;) :

Even better I think - "eat" all escaped characters:

("(?:\\.|.)*?")

1 Comment

There's a flaw in it... It won't handle escaped backslashes correctly. I.e. foo(bar("K1\\"),""); won't be parsed correctly. I'll get back if I find a solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.