Problems with groups in java regex

Question

I'm quite sure that this has a simple solution, but I've been searching for three hours and haven't managed to find anything that helps me.

I'm writing a parser in Java using regex and I'm supposed to be able to match some previously decided words, numbers from 1-10000 and hex color codes. Now it's going great matching the words, but the reader isn't reading the numbers and color codes as a whole. For example it reads the input:

DOWN. COLOR #000000.

as:

Reading: DOWN Returning: Down

Reading: . Returning: Dot

Reading: Returning: Whitespace

Reading: COLOR Returning: Color

Reading: Returning: Whitespace

Reading: # Returning: nothing

Reading: 0 Returning: Number

Reading: A Returning: nothing

Reading: F Returning: nothing

Reading: 2 Returning: Number

Reading: 3 Returning: Number

Reading: 4 Returning: Number

Reading: . Returning: Dot

So it's able to read the words COLOR and DOWN as a whole as I want but it doesn't read the color code #000000. I would ideally want those seven lines to be:

Reading: #0AF234 Returning: Colorcode

I have:

String stringTokens = "DOWN|COLOR|(\\s|\\t)+|\\n|\b[1-9][0-9]{0,3}\b|10000|^(#)([a-fA-F0-9]{6})$";
Pattern stringPattern = Pattern.compile(stringTokens, Pattern.CASE_INSENSITIVE);
Matcher m = stringPattern.matcher(input);

Then:

while (m.find()) {
        if (m.start() != inputPos) {
            tokens.add(new Token(lineNo, TokenType.Invalid));
        }
        if (m.group().matches("^(#)([a-fA-F0-9]{6})$"))
            tokens.add(new Token(lineNo, TokenType.ColorCode));             
        else if (m.group().equals("."))
            tokens.add(new Token(lineNo, TokenType.Dot));
        else if (m.group().matches("DOWN"))
            tokens.add(new Token(lineNo, TokenType.Down));
        else if (m.group().matches("COLOR"))
            tokens.add(new Token(lineNo, TokenType.Color));
        else if (Character.isDigit(m.group().charAt(0)))
            tokens.add(new Token(lineNo, TokenType.Number, Integer.parseInt(m.group())));
        else if (m.group().matches("\\n")) {
            tokens.add(new Token(lineNo, TokenType.Whitespace));
            lineNo++;
        }
        else if (m.group().matches("(\\s|\\t)+"))
            tokens.add(new Token(lineNo, TokenType.Whitespace));
        inputPos = m.end();
    }

So my question is basically:

How do I manage to read the groups regarding the color codes and numbers together? When I print out m.group() for each reading now, it only returns single digits. Yet I was looking at another code where the digits are read in the same format, with the regex above simply [0-9]+, which is too simple for me. Then each group was read as the whole number.

I have tried to use something along the lines of m.group(1) and m.group(2), used the word boundaries (which I don't understand completely) and the ^$ format, but nothing seems to work to read the token as a whole.

I hope I managed to keep the code I copied simple without missing anything important, and that someone can help me figure this simple (it must be?!) thing out. Thank you! :)

do all lines have a particular format? eg UP/DOWN then COLOR then a hex code? if so, your life would be easier if you parsed the whole line instead of bits of it. Let me know. — Bohemian
– Bohemian ♦, Commented May 11, 2015 at 16:11
@Bohemian They have no particular format but a COLOR has to be followed by 1+ spaces, then color code, 0+ spaces and a dot. A syntax error is thrown if this is not the order, and I'm doing that in the parser, but this is just the lexer I'm trying to get to recognize a valid input. I have to work more with the hex code later so I'm not sure what is best to do with it in the lexer. Now it's recognizing the exact words, like UP and DOWN, but not uP, down, and hex codes. It only validates what EQUALS, but never matches, except for in the case of whitespaces. Thank you so much for your help! :) — Helga
– Helga, Commented May 13, 2015 at 16:59

jsantander · Accepted Answer · 2015-05-11 16:11:45Z

1

So you have a regexp:

DOWN|COLOR|(\\s|\\t)+|\\n|\b[1-9][0-9]{0,3}\b|10000|^(#)([a-fA-F0-9]{6})$

That we can decompose as:

DOWN
COLOR
(\\s|\\t)++: one or more \s (OK, this is a whitespace class) or \t (not really needed as \t is included in \s)
\\n (note this is also included in the \s)
\b[1-9][0-9]{0,3}\b: Ok, here you try to use a word-boundary, but you are not taking into account that backslashes need to escaped in a Java string, so it should be \\b. Not sure why would you want to use that?
10000: isn't this covered by the previous pattern?
^(#)([a-fA-F0-9]{6})$: The (#) seems unnecessary, just #. With the ^...$ you're forcing that only content of the input to be the #abcdabcd, so I'd remove it.

How do you match the dot?

Since you need to match again to distinguish the different types of tokens, why don't use multiple regexp (one for each token) (or no regexp at all for the literals) that you will check against the head of the string to parse.

If it matches you have a new token and you can consume the matched part of the string.

answered May 11, 2015 at 16:11

jsantander

5,14218 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Helga Over a year ago

Thank you! It turned out that I was matching the dot in the wrong way. I was doing it but forgot to copy that part to the question, which is a mistake that cost me some time! It threw off the hex code and number matching but everything is working now. I do a \\n separately since I'm keeping a count of the lines. I was doing: m.group().equals(".") - and it always read a dot correctly so I didn't give it more thought. Now I do: m.group().matches("\\."). Something that I should have realized earlier but the error was in another place!

BillRobertson42 · Accepted Answer · 2015-05-11 16:03:59Z

0

Do you need the begin and end-line tokens in the regex around the hex-number part?

e.g. (in Clojure, which just uses the java.util.regex).

With the ^ and $ in, it can't tokenize the hex color (original regex). It can't recognize the hex.

user=> (def r #"DOWN|COLOR|(\\s|\\t)+|\\n|\b[1-9][0-9]{0,3}\b|10000|^(#)([a-fA-F0-9]{6})$")
#'user/r
user=> (re-seq r "DOWN. COLOR #000000.")
(["DOWN" nil nil nil] ["COLOR" nil nil nil])

But without the begin/end line tokens, it can.

user=> (def r #"DOWN|COLOR|(\\s|\\t)+|\\n|\b[1-9][0-9]{0,3}\b|10000|(#)([a-fA-F0-9]{6})")
#'user/r
user=> (re-seq r "DOWN. COLOR #000000.")
(["DOWN" nil nil nil] ["COLOR" nil nil nil] ["#000000" nil "#" "000000"])

answered May 11, 2015 at 16:03

BillRobertson42

12.9k4 gold badges44 silver badges62 bronze badges

Comments

Bohemian · Accepted Answer · 2015-05-14 01:45:28Z

0

You seem to want to assert that if COLOR is present, then the color should be correctly formatted. Also, to make life easier, just convert the line to uppercase before working with it. You can do it without a lot of code and without a Matcher if you use replaceAll() judiciously:

input = input.toUpperCase();
if (input.matches(".*\\bCOLOR\\b.*")) {
    String color = input.replaceAll(".*\\bCOLOR\\b +(?:(#[A-F0-9]{6})\\.)?.*", "$1");
    if (color.isEmpty()) {
         // incorrect syntax
    }
}

The replaceAll() optionally captures the color code, but matches the whole input regardless, so if the input doesn't conform, group 1 captures nothing - returning a blank. If it conforms, you get the color code.

answered May 14, 2015 at 1:45

Bohemian♦

427k103 gold badges603 silver badges750 bronze badges

Collectives™ on Stack Overflow

Problems with groups in java regex

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related