0

I'm trying to use regex to split string into field, but unfortunately it's not working 100% and is skipping some part which should be split. Here is part of program processing string:

void parser(String s) {
    String REG1 = "(',\\d)|(',')|(\\d,')|(\\d,\\d)";
    Pattern p1 = Pattern.compile(REG1);
    Matcher m1 = p1.matcher(s);
while (m1.find() ) {

            System.out.println(counter + ":  "+s.substring(end, m1.end()-1)+" "+end+   "  "+m1.end());
            end =m1.end();
        counter++;
    }
}

The string is:

s= 3101,'12HQ18U0109','11YX27X0041','XX21','SHV7-P Hig, Hig','','GW1','MON','E','A','ASEXPORT-1',1,101,0,'0','1500','V','','',0,'mb-master1'

and the problem is that it doesn't split ,1, or ,0,

Rules for parsing are: String is enclosed by ,' ', for example ,'ASEXPORT-1', int is enclosed only by , ,

expected output =

3101   |   12HQ18U0109  |  11YX27X0041  | XX21    |  SHV7-P Hig, Hig|  |GW1   |MON  |E  |  A|   ASEXPORT-1|  1  |101   |0   |  0  |1500  |   V|    |   |   0   |mb-master1

Altogether 21 elements.

9
  • 3
    Why don't you String.split(',') first and then look at the splits for if they are enclosed by "'" or not? Commented May 23, 2013 at 7:29
  • 1
    Could a string include a comma ? (E.g. 'str,ing') Commented May 23, 2013 at 7:29
  • I think you could just split the while string with "," and then the elements enclosed in single quotes would be string and the elements with no single quotes are int.. Commented May 23, 2013 at 7:30
  • 1
    you should specify your expected output Commented May 23, 2013 at 7:33
  • 1
    Could a string contain an escaped ' ? (E.g. 'SHV7 \'02') Commented May 23, 2013 at 7:41

2 Answers 2

4

You can split it with this regex

,(?=([^']*'[^']*')*[^']*$)

It splits at , only if there are even number of ' ahead


So for

3101,'12HQ18,U0109','11YX27X0041'

output would be

3101
'12HQ18,U0109'
'11YX27X0041'

Note

it wont work for nested strings like 'hello 'h,i'world'..If there are any such cases you should use the following regex

(?<='),(?=')|(?<=\d),(?=\d|')|(?<=\d|'),(?=\d)
Sign up to request clarification or add additional context in comments.

Comments

0

If you also (for some bizarre reason) need to know each matches start and end index in the original string (like you have it in your sample output), you can use the following pattern:

String regex = "('[^']*'|\\d+)";

which would match an unquoted integer or asingle-quoted string.
You can optionally remove the leading and trailing ' using a "second-pass" on the matching substring:

match = match.replaceAll("\\A'|'\\Z", "");

which replaces a leading and trailing ' with nothing.

The code could look like this:

Pattern pat = Pattern.compile("('[^']*'|\\d+)");
Matcher m = pat.matcher(str);

int counter = 0, start = 0;
while (m.find()) {
    String match = m.group(1);
    int end = start + match.length();
    match = match.replaceAll("\\A'|'\\Z", "");   // <-- comment out for NOT replacing 
                                                 //     leading and trailing quotes 
    System.out.format("%d: %s [%d - %d]%n", ++counter, match, start, end);
    start = end + 1;   // <-- the "+1" is to account for the ',' separator
}

See, also, this short demo.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.