Python Regex behaviour with Square Brackets []

Question

This the text file abc.txt

abc.txt

aa:s0:education.gov.in
bb:s1:defence.gov.in
cc:s2:finance.gov.in

I'm trying to parse this file by tokenizing (correct me if this is the incorrect term :) ) at every ":" using the following regular expression.

parser.py

import re,sys,os,subprocess
path = "C:\abc.txt"
site_list = open(path,'r')
for line in site_list:
    site_line = re.search(r'(\w)*:(\w)*:([\w\W]*\.[\W\w]*\.[\W\w]*)',line)
    print('Regex found that site_line.group(2) = '+str(site_line.group(2))

Why is the output

Regex found that site_line.group(2) = 0
Regex found that site_line.group(2) = 1
Regex found that site_line.group(2) = 2

Can someone please help me understand why it matches the last character of the second group ? I think its matching 0 from s0 , 1 from s1 & 2 from s2

But Why ?

regex is overkill for what you're trying to accomplish. Just split the line on the colon, and you will get the elements as a list (line.split(':')) — Darrick Herwehe
– Darrick Herwehe, Commented Feb 23, 2015 at 17:54
"overkill" ? Does that mean its a pretty complicated way of achieving something simple ? :) Or will it be slower than line.split(':') ? Thanks I'll use line.split but I'm also learning Regex which is why the question :) — Dhiwakar Ravikumar
– Dhiwakar Ravikumar, Commented Feb 23, 2015 at 17:58

user2357112 · Accepted Answer · 2015-02-23 18:00:26Z

3

Let's show a simplified example:

>>> re.search(r'(.)*', 'asdf').group(1)
'f'
>>> re.search(r'(.*)', 'asdf').group(1)
'asdf'

If you have a repetition operator around a capturing group, the group stores the last repetition. Putting the group around the repetition operator does what you want.

If you were expecting to see data from the third group, that would be group(3). group(0) is the whole match, and group(1), group(2), etc. count through the actual parenthesized capturing groups.

That said, as the comments suggest, regexes are overkill for this.

>>> 'aa:s0:education.gov.in'.split(':')
['aa', 's0', 'education.gov.in']

answered Feb 23, 2015 at 18:00

user2357112

286k32 gold badges490 silver badges571 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jimilian · Accepted Answer · 2015-02-23 17:57:06Z

2

And first group is entire match by default.

If a groupN argument is zero, the corresponding return value is the entire matching string.

So you should skip it. And check group(3), if you want last one.

Also, you should compile regexp before for-loop. It increase performance of your parser.

And you can replace (\w)* to (\w*), if you want match all symbols between :.

edited Feb 23, 2015 at 17:57

answered Feb 23, 2015 at 17:49

Jimilian

3,94932 silver badges35 bronze badges

2 Comments

Darrick Herwehe Over a year ago

While there may be benefits to pre-compiling, performance improvement is questionable.

Jimilian Over a year ago

@interjay, this answer was based on my conclusions. 1) OP asks, what's wrong with brackets []. Only last group has brackets. So, I decided, that OP wants to print last group. 2) OP are not using group(0), but I decided, that OP want to print last group. But he are using group(2) for this purpose. What's wrong, because group(0) is "bonus".

Collectives™ on Stack Overflow

Python Regex behaviour with Square Brackets []

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related