0

I have a problem regex matching an upper case letter possibly followed by a lower case letter. I want to break after any such matches, but I just can't seem to get it to work.

To make it more general - I want to split before and after any matches in regex.

Example string "TeSTString"

Wanted result -> [Te, S, T, St, ring]

I have tried anything I can think of, but I'm getting tricked by look-ahead or behind.

First I tried [A-Z][a-z]?, and that matches perfect, but removes it...

result -> [ring]

after this I did positive look-ahead (?=([A-Z][a-z]?)) giving me something close...

result -> [Te, S, T, String]

and look-behind (<=?([A-Z][a-z]?)) giving nothing at all...

result -> [TeSTString]

even tried reversing the look-behind (<=?([a-z]?[A-Z])), in a desperate attempt, but this was fairly unsuccessful.

Can anyone give a good pointer in the right direction before I lose my mind?

4
  • 3
    Why should String be split into St and ring? Commented Mar 10, 2016 at 15:32
  • Don't use split, you're going to have to do a regex search and manually build the array (or whatever) yourself. Commented Mar 10, 2016 at 15:34
  • @sp00m because that is the goal of my split? I'm sure I understand the question. I want to split off any ([A-Z][a-z]?), because I'm trying to make a custom simple tokenizer and parser. Commented Mar 10, 2016 at 15:37
  • @Necreaux I can see that working. Can't believe I didn't think of that. Will try that out if I can't find a way with split. Thanks! Commented Mar 10, 2016 at 15:39

2 Answers 2

1

Here's one convoluted pattern that will match the expected result.

String test = "TeSTStringOne";
System.out.println(
    Arrays.toString(
        //          | preceded by lowercase
        //          |        | followed by uppercase
        //          |        |       | or
        //          |        |       || preceded and followed by uppercase
        //          |        |       ||                  | or
        //          |        |       ||                  || preceded by uc
        //          |        |       ||                  || AND lowercase
        test.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z])|(?<=[A-Z][a-z])")
    )
);

Output

[Te, S, T, St, ring, On, e]

Note

Replace [a-z] with \\p{Ll} and [A-Z] with \\p{Lu} to use with accented letters.

Sign up to request clarification or add additional context in comments.

5 Comments

This works perfect. Bonus with the explanation! Great many thanks to you Mena!
@Eric glad it works for you. I just removed the 1st part of the pattern by the way - it seemed redundant and yields the same output with the given test string.
The part was not redundant, TeSTStringOnE now gives Te S T St ringOn E
Which tool have you used to generate the explanation? Or have you done it manually?
@sp00m manually :( it's a pain but I don't like the automatic tools too much
0

Try with:

(?<=[A-Z][a-z])|(?=(?<!^)[A-Z])

DEMO

  • (?<=[A-Z][a-z]) = positive lookbehind for upper case followed by lower case,
  • (?=(?<!^)[A-Z]) - positive lookahead for upper case, if not preceded by beginnig of a line,

1 Comment

This seem to work as well! You guys are really fast and good at this!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.