python regex gives empty string

Question

First off, I am new to regex. But so far I am in love with them. I am using regex to extract info from an image files name that I get from render engine. So far this regex is working decently...

_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$

If I use the split() method on a file name such as...

image_file_name_ao.0001.exr

I get back I nice little list I can use....

['image_file_name', 'gi', None, '.', '0001', 'exr', '']

My only concern is that it always returns an empty string last. No matter how I change or manipulate the regex it always gives me an empty string at the end of the list. I am totally comfortable with ignoring it and moving on, but my question is am I doing something wrong with my regex or is there something I can do to make it not pass that final empty string? Thank you for your time.

See my answer to your question, please. But why did you want to use re.split instead of capturing groups, as in Katzwinkel's answer ? - By the way , why don't you capture in groups the potential undescore before (\d{1,2})? and the last dot ? — eyquem
– eyquem, Commented Feb 28, 2013 at 22:08

J. Katzwinkel · Accepted Answer · 2013-02-28 22:29:51Z

3

No wonder. The split method splits your string at occurences of the regex (plus returns group ranges). And since your regex matches only substrings which reach until the end of the line (indicated by the $ at its end), there is nothing to split off at the file name's end but an empty suffix ('').

Given that you are already using groups "(...)" in your expression, you could as well use re.match(regex, string). This will give you a MatchObject instance, from which you can retrieve a tuple containing your groups via groups():

# additional group up front
reg='(\S*)_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$' 
print re.match(reg, filename).groups() # request tuple of group matches

Edit: I'm really sorry but I didn't realize that your pattern does not match the file name string from its first character on. I extended it in my answer. If you want to stick with your approach using split(), you might also change your original pattern in a way that the last part of the file name is not matched and hence split off.

edited Feb 28, 2013 at 22:29

answered Feb 28, 2013 at 20:16

J. Katzwinkel

1,95316 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Matt Pearson Over a year ago

Thanks for your quick answer. I have tried the match() method, but I only ever receive back a NoneType. I have tried it by both compiling the regex as well as the way you explained above, to no affect. The test string I was using above worked fine with the split() method but never seemed to work with the match() method.

J. Katzwinkel Over a year ago

You probably didn't notice the revision I made on my answer. It works like it is shown right now. Try re.match('(\S*)_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$', 'image_file_name_ao.0001.exr').groups() and I promise you it will do the trick. I am very sorry for the confusion!

Matt Pearson Over a year ago

I completely missed that addition, my fault. Now it is working perfectly, and thank you very much for the help.

J. Katzwinkel Over a year ago

My pleasure! Great that I could help. If you really are into regex, you might want to consider learning Perl.

eyquem Over a year ago

@J. Katzwinkel @Matt Pearson [g for g in m.groups()] is the list of the groups. If you absolutely want the groups presented in a list, list(re.match(reg, filename).groups()) is more direct. Otherwise, use re.match(reg, filename).groups() that is the native presentation of the groups, in a tuple. Maybe you want specifically a list of the groups ?

|

eyquem · Accepted Answer · 2013-02-28 21:48:15Z

1

Interesting question.

I changed a little the regex's pattern:

import re

reg = re.compile('_([a-z]{2,8})'

                 '_?(\d\d?)?'

                 '([._])'
                 '(\d{3,10})'
                 '\.'
                 '(?=[a-z]{2,6}$)')

for ss in ('image_file_name_ao.0001.exr',
           'image_file_name_45_ao.0001.exr',
           'image_file_name_ao_78.0001.exr',
           'image_file_name_ao78.0001.exr'):
    print '%s\n%r\n' % ( ss, reg.split(ss) )

result

image_file_name_ao.0001.exr
['image_file_name', 'ao', None, '.', '0001', 'exr']

image_file_name_45_ao.0001.exr
['image_file_name_45', 'ao', None, '.', '0001', 'exr']

image_file_name_ao_78.0001.exr
['image_file_name', 'ao', '78', '.', '0001', 'exr']

image_file_name_ao78.0001.exr
['image_file_name', 'ao', '78', '.', '0001', 'exr']

answered Feb 28, 2013 at 21:48

eyquem

27.7k7 gold badges43 silver badges46 bronze badges

3 Comments

J. Katzwinkel Over a year ago

I've spent the bigger part of the last hour to get an idea of what exactly re.split() is doing when combined with groups, and all I archieved was even more despair. Now you show up and all of a sudden I get it! Thank you for rearranging the expression in a way that emphasizes the individual patterns. And for the reminder on special characters within sets.

eyquem Over a year ago

@Katzwinkel In fact, at first glance, it wasn't clear for me how the code of OP gives its result. So I did the same as you, I studied the functionning of re.split when several groups are defined in the pattern. Then, I answered exactly to the question, which is a Y question of a XY problem. (meta.stackexchange.com/questions/66377/what-is-the-xy-problem) You answered to the X problem. So I upvoted your answer which deserves to be accepted.

J. Katzwinkel Over a year ago

Thank you for enlightening me again, @eyquem. I never knew there was a dedicated terminology like this, but it makes sense to me. Quite often I feel challenged to identify the actual question in an OP. On the other hand, new users might feel pressured at first, seing other people's questions downvoted for not showing enough proof of personal effort. Many might feel like they better should come up with at least something, and moments later, ten people are dealing with a Y again. It was fun working on this one and I am happy to have learned so much. Cheers!

Max · Accepted Answer · 2015-08-12 20:03:29Z

1

You can use filter()

Given your example this would work like,

def f(x):
    return x != '' 

filter
(
    f,
    re.split('_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$',
    'image_file_name_ao.0001.exr')
)

edited Aug 12, 2015 at 20:03

answered Aug 10, 2015 at 16:09

Max

2,1662 gold badges21 silver badges27 bronze badges

Collectives™ on Stack Overflow

python regex gives empty string

3 Answers 3

7 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related