0

I'm working on a mini project for my semester for the course Compiler Construction.

I'm designing the Scanner part as of now for Java Language in the Java Language. This scanner will produce tokens which will be later used for the parser...

Most of the work I've done is using the Java Regular Expressions. The problem i'm currently facing is that when i pre process the code to remove inline & multi line comments, it also removes the comments inside string literals if there are any. I'm using the following regex:

String regExPreProcess = "((?s)(/\\*.*?\\*/|/\\*.*))|(//.*)"

Could someone please shed some light to solve the issue. I've tried lookahead & lookbehind functionality as well, but the issue is still persisting.

6
  • 1
    I'm not even sure that's something a regex can do... Commented Nov 11, 2015 at 18:18
  • @Louis is right, regexes are no use for this. You can't just pluck out the bits that don't interest you, because you can't reliably identify them without knowing the whole context. Commented Nov 11, 2015 at 19:04
  • Are you sure that's what you want? What does it mean for a string literal to have a comment inside it? Why would you ever want that? Commented Nov 11, 2015 at 19:26
  • @mvd: That's the point: they're not comments. I believe he wants to remove all comments before he starts the "real" lexing, but he knows string literals may contain things things that look like comments, and he wants to know how to ignore them. (Please correct me if I'm wrong, Umar.) Commented Nov 11, 2015 at 20:03
  • @Alan, yes that's what i wanna do... e.g. if there is code like "This is string //not a comment" OR "This is string /* not a comment */" Then the above regex must not remove comments inside the strings that start with comment symbols. Commented Nov 11, 2015 at 20:37

1 Answer 1

0

You first need to make a formal definition of inline and block (multi-line) comments.

Something, like:

  • inline comment starts with an inline comment delimeter (//) placed outside string literals and block comments and ends at the end of line
  • string literal starts with a double quote (") placed outside the inline or block comments and ends with a not escaped double quote (")
  • escaped double quote is a double quote prepended with an odd number of back slashes (\)
  • block comment starts with a comment opening delimeter (/*) placed outside string literals and inline comments and ends with a comment closing delimeter (*/)

As you see, there are cyclic dependencies in these definitions. Regular expressions are not suitable for this problem. You need to process the input text sequentially: detect the start token and ignore everything till the respective end token.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.