12

String to be split

abc:def:ghi\:klm:nop

String should be split based on ":" "\" is escape character. So "\:" should not be treated as token.

split(":") gives

[abc]
[def]
[ghi\]
[klm]
[nop]

Required output is array of string

[abc]
[def]
[ghi\:klm]
[nop]

How can the \: be ignored

3
  • Is the following also possible: abc:"def:ghi":jkl? Commented Oct 6, 2010 at 9:35
  • I believe third result should be [ghi:klm]. '\' was meant to escape the :, not to be part of the output. Commented Aug 1, 2018 at 4:19
  • rosettacode.org/wiki/Tokenize_a_string_with_escaping Commented Aug 25, 2018 at 10:34

2 Answers 2

18

Use a look-behind assertion:

split("(?<!\\\\):")

This will only match if there is no preceding \. Using double escaping \\\\ is required as one is required for the string declaration and one for the regular expression.

Note however that this will not allow you to escape backslashes, in the case that you want to allow a token to end with a backslash. To do that you will have to first replace all double backslashes with

string.replaceAll("\\\\\\\\", ESCAPE_BACKSLASH)

(where ESCAPE_BACKSLASH is a string which will not occur in your input) and then, after splitting using the look-behind assertion, replace the ESCAPE_BACKSLASH string with an unescaped backslash with

token.replaceAll(ESCAPE_BACKSLASH, "\\\\")
Sign up to request clarification or add additional context in comments.

Comments

1

Gumbo was right using a look-behind assertion, but in case your string contains the escaped escape character (e.g. \\) right in front of a comma, the split might break. See this example:

test1\,test1,test2\\,test3\\\,test3\\\\,test4

If you do a simple look-behind split for (?<!\\), as Gumbo suggested, the string gets split into two parts only test1\,test1 and test2\\,test3\\\,test3\\\\,test4. This is because the look-behind just checks one character back for the escape character. What would actually be correct, if the string is split on commas and commas preceded by an even number of escape characters.

To achieve this a slightly more complex (double) look-behind expression is needed:

(?<!(?<![^\\]\\(?:\\{2}){0,10})\\),

Using this more complex regular expression in Java, again requires to escape all \ by \\. So this should be a more sophisticated answer to your question:

"any comma separated string".split("(?<!(?<![^\\\\]\\\\(?:\\\\{2}){0,10})\\\\),");

Note: Java does not support infinite repetitions inside of lookbehinds. Therefore only up to 10 repeating double escape characters are checked by using the expression {0,10}. If needed, you can increase this value by adjusting the latter number.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.