Regex matching Java

Question

I have a problem regex matching an upper case letter possibly followed by a lower case letter. I want to break after any such matches, but I just can't seem to get it to work.

To make it more general - I want to split before and after any matches in regex.

Example string "TeSTString"

Wanted result -> [Te, S, T, St, ring]

I have tried anything I can think of, but I'm getting tricked by look-ahead or behind.

First I tried [A-Z][a-z]?, and that matches perfect, but removes it...

result -> [ring]

after this I did positive look-ahead (?=([A-Z][a-z]?)) giving me something close...

result -> [Te, S, T, String]

and look-behind (<=?([A-Z][a-z]?)) giving nothing at all...

result -> [TeSTString]

even tried reversing the look-behind (<=?([a-z]?[A-Z])), in a desperate attempt, but this was fairly unsuccessful.

Can anyone give a good pointer in the right direction before I lose my mind?

Don't use split, you're going to have to do a regex search and manually build the array (or whatever) yourself. — Necreaux
– Necreaux, Commented Mar 10, 2016 at 15:34
@sp00m because that is the goal of my split? I'm sure I understand the question. I want to split off any ([A-Z][a-z]?), because I'm trying to make a custom simple tokenizer and parser. — Eric
– Eric, Commented Mar 10, 2016 at 15:37
@Necreaux I can see that working. Can't believe I didn't think of that. Will try that out if I can't find a way with split. Thanks! — Eric
– Eric, Commented Mar 10, 2016 at 15:39

Mena · Accepted Answer · 2016-03-10 15:53:11Z

1

Here's one convoluted pattern that will match the expected result.

String test = "TeSTStringOne";
System.out.println(
    Arrays.toString(
        //          | preceded by lowercase
        //          |        | followed by uppercase
        //          |        |       | or
        //          |        |       || preceded and followed by uppercase
        //          |        |       ||                  | or
        //          |        |       ||                  || preceded by uc
        //          |        |       ||                  || AND lowercase
        test.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z])|(?<=[A-Z][a-z])")
    )
);

Output

[Te, S, T, St, ring, On, e]

Note

Replace [a-z] with \\p{Ll} and [A-Z] with \\p{Lu} to use with accented letters.

edited Mar 10, 2016 at 15:53

answered Mar 10, 2016 at 15:35

Mena

48.6k11 gold badges90 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Eric Over a year ago

This works perfect. Bonus with the explanation! Great many thanks to you Mena!

Mena Over a year ago

@Eric glad it works for you. I just removed the 1st part of the pattern by the way - it seemed redundant and yields the same output with the given test string.

Eric Over a year ago

The part was not redundant, TeSTStringOnE now gives Te S T St ringOn E

sp00m Over a year ago

Which tool have you used to generate the explanation? Or have you done it manually?

Mena Over a year ago

@sp00m manually :( it's a pain but I don't like the automatic tools too much

m.cekiera · Accepted Answer · 2016-03-10 15:42:14Z

0

Try with:

(?<=[A-Z][a-z])|(?=(?<!^)[A-Z])

DEMO

(?<=[A-Z][a-z]) = positive lookbehind for upper case followed by lower case,
(?=(?<!^)[A-Z]) - positive lookahead for upper case, if not preceded by beginnig of a line,

answered Mar 10, 2016 at 15:42

m.cekiera

5,3935 gold badges24 silver badges35 bronze badges

1 Comment

Eric Over a year ago

This seem to work as well! You guys are really fast and good at this!

Collectives™ on Stack Overflow

Regex matching Java

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related