10

What regex can I use to match ".#,#." within a string. It may or may not exist in the string. Some examples with expected outputs might be:

Test1.0,0.csv      -> ('Test1', '0,0', 'csv')         (Basic Example)
Test2.wma          -> ('Test2', 'wma')                (No Match)
Test3.1100,456.jpg -> ('Test3', '1100,456', 'jpg')    (Basic with Large Number)
T.E.S.T.4.5,6.png  -> ('T.E.S.T.4', '5,6', 'png')     (Doesn't strip all periods)
Test5,7,8.sss      -> ('Test5,7,8', 'sss')            (No Match)
Test6.2,3,4.png    -> ('Test6.2,3,4', 'png')          (No Match, to many commas)
Test7.5,6.7,8.test -> ('Test7', '5,6', '7,8', 'test') (Double Match?)

The last one isn't too important and I would only expect that .#,#. would appear once. Most files I'm processing, I would expect to fall into the first through fourth examples, so I'm most interested in those.

Thanks for the help!

2
  • 4
    Awww man. If only everyone would provide such an extensive list of examples that match and examples that fail... Commented Sep 26, 2012 at 18:35
  • @m.buettner I know, this is beautiful in comparison to 99% of regex questions Commented Sep 26, 2012 at 18:39

6 Answers 6

4

You can use the regex \.\d+,\d+\. to find all matches for that pattern, but you will need to do a little extra to get the output you expect, especially since you want to treat .5,6.7,8. as two matches.

Here is one potential solution:

def transform(s):
    s = re.sub(r'(\.\d+,\d+)+\.', lambda m: m.group(0).replace('.', '\n'), s)
    return tuple(s.split('\n'))

Examples:

>>> transform('Test1.0,0.csv')
('Test1', '0,0', 'csv')
>>> transform('Test2.wma')
('Test2.wma',)
>>> transform('Test3.1100,456.jpg')
('Test3', '1100,456', 'jpg')
>>> transform('T.E.S.T.4.5,6.png')
('T.E.S.T.4', '5,6', 'png')
>>> transform('Test5,7,8.sss')
('Test5,7,8.sss',)
>>> transform('Test6.2,3,4.png')
('Test6.2,3,4.png',)
>>> transform('Test7.5,6.7,8.test')
('Test7', '5,6', '7,8', 'test')

To also get the file extension split off when there are no matches, you can use the following:

def transform(s):
    s = re.sub(r'(\.\d+,\d+)+\.', lambda m: m.group(0).replace('.', '\n'), s)
    groups = s.split('\n')
    groups[-1:] = groups[-1].rsplit('.', 1)
    return tuple(groups)

This will be the same output as above except that 'Test2.wma' becomes ('Test2', 'wma'), with similar behavior for 'Test5,7,8.sss' and 'Test5,7,8.sss'.

Sign up to request clarification or add additional context in comments.

5 Comments

Also if the last group contains more than one . you will end up splitting the last group several times.
Just modified it to use \n instead of a space, you could also use something like \x00 to be more sure it won't be included in a valid string.
transform('.a.a.a.a.a.a.') == ('', 'a', 'a', 'a', 'a', 'a', 'a', '')
@nneonneo ahh I see, I forgot the count argument to rsplit, thanks.
Thanks for the example! I really need to take the time to learn more about regex, it's so powerful.
3

To allow for multiple consecutive matches, use lookahead/lookbehind:

r'(?<=\.)\d+,\d+(?=\.)'

Example:

>>> re.findall(r'(?<=\.)\d+,\d+(?=\.)', 'Test7.5,6.7,8.test')
['5,6', '7,8']

We can also use lookahead to perform the split as you want it:

import re
def split_it(s):
    pieces = re.split(r'\.(?=\d+,\d+\.)', s)
    pieces[-1:] = pieces[-1].rsplit('.', 1) # split off extension
    return pieces

Testing:

>>> print split_it('Test1.0,0.csv')
['Test1', '0,0', 'csv']
>>> print split_it('Test2.wma')
['Test2', 'wma']
>>> print split_it('Test3.1100,456.jpg')
['Test3', '1100,456', 'jpg']
>>> print split_it('T.E.S.T.4.5,6.png')
['T.E.S.T.4', '5,6', 'png']
>>> print split_it('Test5,7,8.sss')
['Test5,7,8', 'sss']
>>> print split_it('Test6.2,3,4.png')
['Test6.2,3,4', 'png']
>>> print split_it('Test7.5,6.7,8.test')
['Test7', '5,6', '7,8', 'test']

Comments

0
'/^(.+)\.((\d+,\d+)\.)?(.+)$/'

The third capturing group should contain the pair of numbers. If you have multiple of those pairs, you should get multiple matches. And the third capturing would always contain the pair.

Comments

0
^(.*?)\.(\d+,\d+)\.(.*?)$

This passes your tests, at least in Patterns:

Passing tests in Patterns

Comments

0

This is pretty close, does python support named groups?

^.*(?P<group1>\d+(?:,\d+)?)\.(?P<group2>\d+(?:,\d+)?).*\..+$

1 Comment

The named group syntax is (?P<name>pattern)
0

Use regex pattern ^([^,]+)\.(\d+,\d+)\.([^,.]+)$

Check this demo >>

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test1.0,0.csv')
[('Test1', '0,0', 'csv')]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test2.wma')
[]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test3.1100,456.jpg')
[('Test3', '1100,456', 'jpg')]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'T.E.S.T.4.5,6.png')
[('T.E.S.T.4', '5,6', 'png')]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test5,7,8.sss')
[]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test6.2,3,4.png')
[]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test7.5,6.7,8.test') 
[]

1 Comment

What does this produce for: Test.xx,yz.csv?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.