3

This the text file abc.txt

abc.txt

aa:s0:education.gov.in
bb:s1:defence.gov.in
cc:s2:finance.gov.in

I'm trying to parse this file by tokenizing (correct me if this is the incorrect term :) ) at every ":" using the following regular expression.

parser.py

import re,sys,os,subprocess
path = "C:\abc.txt"
site_list = open(path,'r')
for line in site_list:
    site_line = re.search(r'(\w)*:(\w)*:([\w\W]*\.[\W\w]*\.[\W\w]*)',line)
    print('Regex found that site_line.group(2) = '+str(site_line.group(2))

Why is the output

Regex found that site_line.group(2) = 0
Regex found that site_line.group(2) = 1
Regex found that site_line.group(2) = 2

Can someone please help me understand why it matches the last character of the second group ? I think its matching 0 from s0 , 1 from s1 & 2 from s2

But Why ?

3
  • Why you are using re.search instead re.match? Commented Feb 23, 2015 at 17:47
  • 2
    regex is overkill for what you're trying to accomplish. Just split the line on the colon, and you will get the elements as a list (line.split(':')) Commented Feb 23, 2015 at 17:54
  • "overkill" ? Does that mean its a pretty complicated way of achieving something simple ? :) Or will it be slower than line.split(':') ? Thanks I'll use line.split but I'm also learning Regex which is why the question :) Commented Feb 23, 2015 at 17:58

2 Answers 2

3

Let's show a simplified example:

>>> re.search(r'(.)*', 'asdf').group(1)
'f'
>>> re.search(r'(.*)', 'asdf').group(1)
'asdf'

If you have a repetition operator around a capturing group, the group stores the last repetition. Putting the group around the repetition operator does what you want.

If you were expecting to see data from the third group, that would be group(3). group(0) is the whole match, and group(1), group(2), etc. count through the actual parenthesized capturing groups.

That said, as the comments suggest, regexes are overkill for this.

>>> 'aa:s0:education.gov.in'.split(':')
['aa', 's0', 'education.gov.in']
Sign up to request clarification or add additional context in comments.

Comments

2

And first group is entire match by default.

If a groupN argument is zero, the corresponding return value is the entire matching string.

So you should skip it. And check group(3), if you want last one.

Also, you should compile regexp before for-loop. It increase performance of your parser.

And you can replace (\w)* to (\w*), if you want match all symbols between :.

2 Comments

While there may be benefits to pre-compiling, performance improvement is questionable.
@interjay, this answer was based on my conclusions. 1) OP asks, what's wrong with brackets []. Only last group has brackets. So, I decided, that OP wants to print last group. 2) OP are not using group(0), but I decided, that OP want to print last group. But he are using group(2) for this purpose. What's wrong, because group(0) is "bonus".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.