3

First off, I am new to regex. But so far I am in love with them. I am using regex to extract info from an image files name that I get from render engine. So far this regex is working decently...

_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$

If I use the split() method on a file name such as...

image_file_name_ao.0001.exr

I get back I nice little list I can use....

['image_file_name', 'gi', None, '.', '0001', 'exr', '']

My only concern is that it always returns an empty string last. No matter how I change or manipulate the regex it always gives me an empty string at the end of the list. I am totally comfortable with ignoring it and moving on, but my question is am I doing something wrong with my regex or is there something I can do to make it not pass that final empty string? Thank you for your time.

1
  • See my answer to your question, please. But why did you want to use re.split instead of capturing groups, as in Katzwinkel's answer ? - By the way , why don't you capture in groups the potential undescore before (\d{1,2})? and the last dot ? Commented Feb 28, 2013 at 22:08

3 Answers 3

3

No wonder. The split method splits your string at occurences of the regex (plus returns group ranges). And since your regex matches only substrings which reach until the end of the line (indicated by the $ at its end), there is nothing to split off at the file name's end but an empty suffix ('').

Given that you are already using groups "(...)" in your expression, you could as well use re.match(regex, string). This will give you a MatchObject instance, from which you can retrieve a tuple containing your groups via groups():

# additional group up front
reg='(\S*)_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$' 
print re.match(reg, filename).groups() # request tuple of group matches

Edit: I'm really sorry but I didn't realize that your pattern does not match the file name string from its first character on. I extended it in my answer. If you want to stick with your approach using split(), you might also change your original pattern in a way that the last part of the file name is not matched and hence split off.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for your quick answer. I have tried the match() method, but I only ever receive back a NoneType. I have tried it by both compiling the regex as well as the way you explained above, to no affect. The test string I was using above worked fine with the split() method but never seemed to work with the match() method.
You probably didn't notice the revision I made on my answer. It works like it is shown right now. Try re.match('(\S*)_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$', 'image_file_name_ao.0001.exr').groups() and I promise you it will do the trick. I am very sorry for the confusion!
I completely missed that addition, my fault. Now it is working perfectly, and thank you very much for the help.
My pleasure! Great that I could help. If you really are into regex, you might want to consider learning Perl.
@J. Katzwinkel @Matt Pearson [g for g in m.groups()] is the list of the groups. If you absolutely want the groups presented in a list, list(re.match(reg, filename).groups()) is more direct. Otherwise, use re.match(reg, filename).groups() that is the native presentation of the groups, in a tuple. Maybe you want specifically a list of the groups ?
|
1

Interesting question.

I changed a little the regex's pattern:

import re

reg = re.compile('_([a-z]{2,8})'

                 '_?(\d\d?)?'

                 '([._])'
                 '(\d{3,10})'
                 '\.'
                 '(?=[a-z]{2,6}$)')

for ss in ('image_file_name_ao.0001.exr',
           'image_file_name_45_ao.0001.exr',
           'image_file_name_ao_78.0001.exr',
           'image_file_name_ao78.0001.exr'):
    print '%s\n%r\n' % ( ss, reg.split(ss) )

result

image_file_name_ao.0001.exr
['image_file_name', 'ao', None, '.', '0001', 'exr']

image_file_name_45_ao.0001.exr
['image_file_name_45', 'ao', None, '.', '0001', 'exr']

image_file_name_ao_78.0001.exr
['image_file_name', 'ao', '78', '.', '0001', 'exr']

image_file_name_ao78.0001.exr
['image_file_name', 'ao', '78', '.', '0001', 'exr']

3 Comments

I've spent the bigger part of the last hour to get an idea of what exactly re.split() is doing when combined with groups, and all I archieved was even more despair. Now you show up and all of a sudden I get it! Thank you for rearranging the expression in a way that emphasizes the individual patterns. And for the reminder on special characters within sets.
@Katzwinkel In fact, at first glance, it wasn't clear for me how the code of OP gives its result. So I did the same as you, I studied the functionning of re.split when several groups are defined in the pattern. Then, I answered exactly to the question, which is a Y question of a XY problem. (meta.stackexchange.com/questions/66377/what-is-the-xy-problem) You answered to the X problem. So I upvoted your answer which deserves to be accepted.
Thank you for enlightening me again, @eyquem. I never knew there was a dedicated terminology like this, but it makes sense to me. Quite often I feel challenged to identify the actual question in an OP. On the other hand, new users might feel pressured at first, seing other people's questions downvoted for not showing enough proof of personal effort. Many might feel like they better should come up with at least something, and moments later, ten people are dealing with a Y again. It was fun working on this one and I am happy to have learned so much. Cheers!
1

You can use filter()

Given your example this would work like,

def f(x):
    return x != '' 

filter
(
    f,
    re.split('_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$',
    'image_file_name_ao.0001.exr')
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.