Python Regex Non-Capturing Group

Question

I am trying to understand re.split() function with non-capturing group to split a comma delimited string.

This is my code:

 pattern = re.compile(r',(?=(?:"[^"]*")*[^"]*$)')
 text = 'qarcac,"this is, test1",123566'
 results= re.split(pattern, text)
 for r in results:
    print(r.strip())

When I execute this code, the results are as expected.

split1: qarcac

split2: "this is, test1"

split3: 123566

whereas if i add one more double quoted string to the source text, it doesn't work as expected.

text = 'qarcac,"this is, test1","this is, test2", 123566, testdata'

and produces the below output

split1: qarcac,"this is, test1"

split2: "this is, test2"

split3: 123566

Can someone explain me what's going on here and how non-capturing group works differently in these two cases?

You should use a csv module to parse CSV string. The regex you are using is very inefficient, and in case the string is very long, the performance might drop significantly. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 5, 2018 at 11:38
thanks Wiktor, I am not going to productionize it, instead trying to learn as i had come across this code in one of my learning modules. — AngiSen
– AngiSen, Commented Aug 5, 2018 at 11:42
The pattern that works is ,(?=(?:"[^"]*"|[^"])*$). Or ,(?=[^"]*(?:"[^"]*"[^"]*)*$). See Regex to pick commas outside of quotes. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 5, 2018 at 11:45
See regex101.com/r/dRqJZT/1, there is a good explanation of any regex you type into pattern field on the right. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 5, 2018 at 11:58
thanks Wiktor.. how does re.split() marks the first occurrence of comma in the source string using the following regex when [^"]* is used..... ,(?=(?:"[^"]*"|[^"])*$) — AngiSen
– AngiSen, Commented Aug 5, 2018 at 12:10

melpomene · Accepted Answer · 2018-08-05 11:45:41Z

1

This has nothing to do with (non-)capturing groups.

(?:"[^"]*")*[^"]*$ matches:

"[^"]*" - a quoted string (two quotes with 0 or more non-quotes in between)
(?: ... )* - 0 or more of those quoted strings
[^"]* - followed by 0 or more non-quotes
$ - followed by the end of the string

In other words, this regex matches something like "foo""bar""baz"otherstuff.

In your first example, the target string is:

qarcac,"this is, test1",123566
       ^^^^^^^^^^^^^^^^^^^^^^^

I've underlined the part that is matched by the above regex (a quoted part followed by an unquoted tail followed by the end of the string).

In your second example, the target string is:

qarcac,"this is, test1","this is, test2", 123566, testdata
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Again, I've underlined the part that is matched by the regex.

The first quoted part is not matched because of the comma:

"this is, test1","this is, test2"
                X

"foo","bar" is not matched because your regex requires the quoted parts to be right next to each other, as in "foo""bar", with nothing in between.

If you just want to make sure that every matched comma is outside of a quoted part (i.e. is followed by an even number of quotes), you can simply use

,(?=[^"]*(?:"[^"]*"[^"]*)*$)

as your regex.

answered Aug 5, 2018 at 11:45

melpomene

86.2k8 gold badges96 silver badges154 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AngiSen Over a year ago

thank you.. can you please clarify why [^"]* is used in non-capturing group?

Collectives™ on Stack Overflow

Python Regex Non-Capturing Group

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related