regex comment matching code in java not working properly

Question

I have this code for Identifying the comments and print them in java

import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Solution {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("(\\/\\*((.|\n)*)\\*\\/)|\\/\\/.*");
        String code = "";
        Scanner scan = new Scanner(System.in);
        while(scan.hasNext())
        {
            code+=(scan.nextLine()+"\n");

        }
        Matcher matcher = pattern.matcher(code);
        int nxtBrk=code.indexOf("\n");
        while(matcher.find())
        {

            int i=matcher.start(),j=matcher.end();
            if(nxtBrk<i)
            {
                System.out.print("\n");
            }
            System.out.print(code.substring(i,j));
            nxtBrk = code.indexOf("\n",j);

        }



    scan.close();
    }

}

Now when I try the code against this input

 /*This is a program to calculate area of a circle after getting the radius as input from the user*/  
\#include<stdio.h>  
int main()  
{ //something

It outputs right and only the comments. But when I give the input

 /*This is a program to calculate area of a circle after getting the radius as input from the user*/  
\#include<stdio.h>  
int main()  
{//ok
}  
/*A test run for the program was carried out and following output was observed  
If 50 is the radius of the circle whose area is to be calculated
The area of the circle is 7857.1429*/

The program outputs the whole code instead of just the comments. I don't know what wrong is doing the addition of that last lines.

EDIT: parser is not an option because I am solving problems and I have to use programming language . link https://www.hackerrank.com/challenges/ide-identifying-comments

Re "parser is not an option", not using a parser is not an option either unless you want to find spurious comments in "/* A string, not a comment */", "http://foo", "/path/*.txt" /* A file path */. You need to recognize all tokens that can contain comment boundaries to recognize comment boundaries correctly. — Mike Samuel
– Mike Samuel, Commented Jan 12, 2014 at 16:08
A parser most certainly is an option, especially as you only really need the lexer part (generally the simplest part if you've already got regular expression support available). Beware! This is quite a deep topic to get into properly; it was part of a second-year course back when I took CS (years ago…) — Donal Fellows
– Donal Fellows, Commented Jan 12, 2014 at 17:40
@Unbound, As Donal suggested, you can lex (tokenize) using a single regular expression and then filter out the matches that are not comment tokens. For example, Pattern.compile("(?:" + COMMENT_REGEX + ")|(?:" + STRING_REGEX + ")", ...) where STRING_REGEX = "\"(?:[^\"\\\\]|\\\\.)*\"|'(?:[^'\\\\]|\\\\.)*'". That way, quotes will match as string tokens which will effectively hide any apparent comment boundaries inside string tokens. — Mike Samuel
– Mike Samuel, Commented Jan 13, 2014 at 0:15

Sean Patrick Floyd · Accepted Answer · 2014-01-12 15:54:48Z

3

Parsing source code with regular expressions is very unreliable. I'd suggest you use a specialized parser. Creating one is pretty simple using antlr. And, since you seem to be parsing C source files, you can use the C grammar.

answered Jan 12, 2014 at 15:54

Sean Patrick Floyd

301k72 gold badges481 silver badges598 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Donal Fellows · Accepted Answer · 2014-01-12 15:59:18Z

2

Your pattern, shorn of its Java quoting (and some unnecessary backslashes), is this:

(/\*((.|
)*)\*/)|//.*

That's fine enough, except that it has just greedy quantifiers which means that it will match from the first /* to the last */. You want non-greedy quantifiers instead, to get this pattern:

(/\*((.|
)*?)\*/)|//.*

Small change, big consequence since it now matches to the first */ after the /*. Re-encoded as Java code.

Pattern pattern = Pattern.compile("(/\\*((.|\n)*?)\\*/)|//.*");

(Be aware that you are very close to the limit of what it is sensible to match with regular expressions. Indeed, it's actually incorrect since you might have strings with /* or // in. But you'll probably get away with it…)

answered Jan 12, 2014 at 15:59

Donal Fellows

139k19 gold badges161 silver badges222 bronze badges

Collectives™ on Stack Overflow

regex comment matching code in java not working properly

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related