8

I have to split a string using comma(,) as a separator and ignore any comma that is inside quotes(")

fieldSeparator : ,
fieldGrouper : "

The string to split is : "1","2",3,"4,5"

I am able to achieve it as follows :

String record = "\"1\",\"2\",3,\"4,5\"";
String[] tokens = record.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");

Output :

"1"
"2"
3
"4,5"

Now the challenge is that the fieldGrouper(") should not be a part of the split tokens. I am unable to figure out the regex for this.

The expected output of the split is :

1
2
3
4,5
2
  • I think that doing this char-by-char will actually be more readable and definitely faster. And the algorithm is as simple as it gets. And it's easier to handle the "" exception which will likely appear sooner or later. Commented Mar 7, 2016 at 12:00
  • May we ask why you are working with malformed pseudo JSON input? The funkyness with the quotes makes this hard to deal with and it might be better for you to clean up the source. Commented Mar 7, 2016 at 12:06

4 Answers 4

4

Update:

String[] tokens = record.split( "(,*\",*\"*)" );

Result:
Image Link

Initial Solution:
( doesn't work @ .split method )

This RexEx pattern will isolate the sections you want:
(?:\\")(.*?)(?:\\")

It uses non-capturing groups to isolate the pairs of escaped quotes, and a capturing group to isolate everything in between.

Check it out here: Live Demo

Sign up to request clarification or add additional context in comments.

4 Comments

This regex does not match 3 or any other values not enclosed with "...".
@WiktorStribiżew I updated the solution, but in my initial solution I assumed that the "#" pattern was consistent. I didn't realize that 3 was not captured, and still wonder if @rvd purposely has a different format for 3. Either way, the new solution works.
Sorry but your second sollution will not work for input like 1,2 when 1 and 2 are separate numbers.
@MikhailovValentine The output matches @rvd's requirements. See: Original Post / The expected output of the split is :
2

My suggestion:

"([^"]+)"|(?<=,|^)([^,]*)

See the regex demo. It will match "..." like strings and capture into Group 1 only what is in-between the quotes, and then will match and capture into Group 2 sequences of characters other than , at the start of a string or after a comma.

Here is a Java sample code:

String s = "value1,\"1\",\"2\",3,\"4,5\",value2";
Pattern pattern = Pattern.compile("\"([^\"]+)\"|(?<=,|^)([^,]*)");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<String>();
while (matcher.find()){                      // Run the matcher
    if (matcher.group(1) != null) {          // If Group 1 matched
        res.add(matcher.group(1));           // Add it to the resulting array
    } else {
        res.add(matcher.group(2));           // Add Group 2 as it got matched
    }
} 
System.out.println(res); // => [value1, 1, 2, 3, 4,5, value2]

1 Comment

The better suggestion is that he clean up his source data IMHO.
1

I would try with this kind of workaround:

String record = "\"1\",\"2\",3,\"4,5\"";
record = record.replaceAll("\"?(?<!\"\\w{1,9999}),\"?|\""," ");
String[] tokens = record.trim().split(" ");
for(String str : tokens){
    System.out.println(str);
}

Output:

1
2
3
4,5

1 Comment

I ultimately had to use similar workaround, i.e, first split and then remove quotes(if present) from each token.
0

My proposition:

record = record.replaceAll("\",", "|");
record = record.replaceAll(",\\\"", "|");
record = record.replaceAll("\"", "");

String[] tokens = record.split("\\|");

for (String token : tokens) {
   System.out.println(token);
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.