0

I would like to create a regex so that I can split a string in Java with the following constraints:

Any non-word character, except for:
 (a) Characters surrounded by ' '
 (b) Any instance of    :=   >=   <=   <>   ..

So that for the following sample string:

print('*');  x := x - 100

I can get the following result in a String[]:

print
(
'*'
)
;

x

:=

x

-

100

This is the regex I currently have so far:

str.split("\\s+|"+
          "(?=[^\\w'][^']*('[^']*'[^']*)*$)|" +
          "(?<=[^\\w'])(?=[^']*('[^']*'[^']*)*$)|" +
          "(?=('[^']*'[^']*)*$)|" +
          "(?<=')(?=[^']*('[^']*'[^']*)*$)");

But this gives me the following result:

print
(
'*'
)
;

x

:    
=    <!-- This is the problem. Should be above next to the :

x

-

100

UPDATE

I have now learned that it's not possible to achieve this using Regex.

However, I still cannot use any external or frameworks or lexers, and have to use included Java methods, such as StringTokenizer.

7
  • 2
    You cannot do (a) with a regular expression, period. A language with matched delimiter pairs is not a regular language. You need to write/use a proper lexer. Commented Sep 24, 2016 at 21:35
  • can't he use lookback and lookforward in some way? Commented Sep 24, 2016 at 21:36
  • @OrangeDog But it works well with the current regex, however only with one of the two constraints. Is it not possible to add additional regex for constraint (2)? Commented Sep 24, 2016 at 21:38
  • 2
    @Gus No. For the same reason you cannot parse html with a regular expression. Commented Sep 24, 2016 at 21:38
  • @DarkKnight no it doesn't work well. It just happens to work for your specific example, but it will quickly break down with a more complicated structure of nested quotes. Commented Sep 24, 2016 at 21:39

1 Answer 1

1

Disclaimer: Regex is not a generic parser. If the text you're reading is a complex language, with nested constructs, then you need to use an actual lexer, not a regex. E.g. the code below supports "Characters surrounded by ' '", which is a simple definition, but if the characters can contain escaped ' characters, you'll need a lexer.

Don't use split().

Your code will be much easier to read and understand if you use a find() loop. It'll also perform better.

You write your regex to specify what you want to capture in one iteration of the find() loop. You can rely on | to choose the first pattern that matches, so put more specific patterns first.

Pattern p = Pattern.compile("\\s+" +    // sequence of whitespace
                           "|\\w+" +    // sequence of word characters
                           "|'[^']*'" + // Characters surrounded by ' '
                           "|[:><]=" +  // :=   >=   <=
                           "|<>" +      // <>
                           "|\\.\\." +  // ..
                           "|.");       // Any single other character
String input = "print('*');  x := x - 100";
for (Matcher m = p.matcher(input); m.find(); )
    System.out.println(m.group());

Output

print
(
'*'
)
;

x

:=

x

-

100
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.