Python/Regex - Match .#,#. in String

Question

What regex can I use to match ".#,#." within a string. It may or may not exist in the string. Some examples with expected outputs might be:

Test1.0,0.csv      -> ('Test1', '0,0', 'csv')         (Basic Example)
Test2.wma          -> ('Test2', 'wma')                (No Match)
Test3.1100,456.jpg -> ('Test3', '1100,456', 'jpg')    (Basic with Large Number)
T.E.S.T.4.5,6.png  -> ('T.E.S.T.4', '5,6', 'png')     (Doesn't strip all periods)
Test5,7,8.sss      -> ('Test5,7,8', 'sss')            (No Match)
Test6.2,3,4.png    -> ('Test6.2,3,4', 'png')          (No Match, to many commas)
Test7.5,6.7,8.test -> ('Test7', '5,6', '7,8', 'test') (Double Match?)

The last one isn't too important and I would only expect that .#,#. would appear once. Most files I'm processing, I would expect to fall into the first through fourth examples, so I'm most interested in those.

Thanks for the help!

Awww man. If only everyone would provide such an extensive list of examples that match and examples that fail... — Martin Ender
– Martin Ender, Commented Sep 26, 2012 at 18:35
@m.buettner I know, this is beautiful in comparison to 99% of regex questions — JKirchartz
– JKirchartz, Commented Sep 26, 2012 at 18:39

Andrew Clark · Accepted Answer · 2012-09-26 18:54:00Z

4

You can use the regex \.\d+,\d+\. to find all matches for that pattern, but you will need to do a little extra to get the output you expect, especially since you want to treat .5,6.7,8. as two matches.

Here is one potential solution:

def transform(s):
    s = re.sub(r'(\.\d+,\d+)+\.', lambda m: m.group(0).replace('.', '\n'), s)
    return tuple(s.split('\n'))

Examples:

>>> transform('Test1.0,0.csv')
('Test1', '0,0', 'csv')
>>> transform('Test2.wma')
('Test2.wma',)
>>> transform('Test3.1100,456.jpg')
('Test3', '1100,456', 'jpg')
>>> transform('T.E.S.T.4.5,6.png')
('T.E.S.T.4', '5,6', 'png')
>>> transform('Test5,7,8.sss')
('Test5,7,8.sss',)
>>> transform('Test6.2,3,4.png')
('Test6.2,3,4.png',)
>>> transform('Test7.5,6.7,8.test')
('Test7', '5,6', '7,8', 'test')

To also get the file extension split off when there are no matches, you can use the following:

def transform(s):
    s = re.sub(r'(\.\d+,\d+)+\.', lambda m: m.group(0).replace('.', '\n'), s)
    groups = s.split('\n')
    groups[-1:] = groups[-1].rsplit('.', 1)
    return tuple(groups)

This will be the same output as above except that 'Test2.wma' becomes ('Test2', 'wma'), with similar behavior for 'Test5,7,8.sss' and 'Test5,7,8.sss'.

edited Sep 26, 2012 at 18:54

answered Sep 26, 2012 at 18:41

Andrew Clark

210k36 gold badges285 silver badges310 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

nneonneo Over a year ago

Also if the last group contains more than one . you will end up splitting the last group several times.

Andrew Clark Over a year ago

Just modified it to use \n instead of a space, you could also use something like \x00 to be more sure it won't be included in a valid string.

nneonneo Over a year ago

transform('.a.a.a.a.a.a.') == ('', 'a', 'a', 'a', 'a', 'a', 'a', '')

Andrew Clark Over a year ago

@nneonneo ahh I see, I forgot the count argument to rsplit, thanks.

Scott B Over a year ago

Thanks for the example! I really need to take the time to learn more about regex, it's so powerful.

nneonneo · Accepted Answer · 2012-09-26 18:52:18Z

To allow for multiple consecutive matches, use lookahead/lookbehind:

r'(?<=\.)\d+,\d+(?=\.)'

Example:

>>> re.findall(r'(?<=\.)\d+,\d+(?=\.)', 'Test7.5,6.7,8.test')
['5,6', '7,8']

We can also use lookahead to perform the split as you want it:

import re
def split_it(s):
    pieces = re.split(r'\.(?=\d+,\d+\.)', s)
    pieces[-1:] = pieces[-1].rsplit('.', 1) # split off extension
    return pieces

Testing:

>>> print split_it('Test1.0,0.csv')
['Test1', '0,0', 'csv']
>>> print split_it('Test2.wma')
['Test2', 'wma']
>>> print split_it('Test3.1100,456.jpg')
['Test3', '1100,456', 'jpg']
>>> print split_it('T.E.S.T.4.5,6.png')
['T.E.S.T.4', '5,6', 'png']
>>> print split_it('Test5,7,8.sss')
['Test5,7,8', 'sss']
>>> print split_it('Test6.2,3,4.png')
['Test6.2,3,4', 'png']
>>> print split_it('Test7.5,6.7,8.test')
['Test7', '5,6', '7,8', 'test']

Martin Ender · Accepted Answer · 2012-09-26 18:40:37Z

0

'/^(.+)\.((\d+,\d+)\.)?(.+)$/'

The third capturing group should contain the pair of numbers. If you have multiple of those pairs, you should get multiple matches. And the third capturing would always contain the pair.

answered Sep 26, 2012 at 18:40

Martin Ender

44.4k11 gold badges93 silver badges132 bronze badges

Comments

David Eyk · Accepted Answer · 2012-09-26 18:41:14Z

0

^(.*?)\.(\d+,\d+)\.(.*?)$

This passes your tests, at least in Patterns:

Passing tests in Patterns

answered Sep 26, 2012 at 18:41

David Eyk

12.7k12 gold badges69 silver badges106 bronze badges

Comments

CaffGeek · Accepted Answer · 2012-09-26 18:44:19Z

0

This is pretty close, does python support named groups?

^.*(?P<group1>\d+(?:,\d+)?)\.(?P<group2>\d+(?:,\d+)?).*\..+$

answered Sep 26, 2012 at 18:44

CaffGeek

22.2k18 gold badges106 silver badges186 bronze badges

1 Comment

David Eyk Over a year ago

The named group syntax is (?P<name>pattern)

Ωmega · Accepted Answer · 2012-09-26 19:20:33Z

0

Use regex pattern ^([^,]+)\.(\d+,\d+)\.([^,.]+)$

Check this demo >>

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test1.0,0.csv')
[('Test1', '0,0', 'csv')]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test2.wma')
[]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test3.1100,456.jpg')
[('Test3', '1100,456', 'jpg')]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'T.E.S.T.4.5,6.png')
[('T.E.S.T.4', '5,6', 'png')]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test5,7,8.sss')
[]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test6.2,3,4.png')
[]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test7.5,6.7,8.test') 
[]

edited Sep 26, 2012 at 19:20

answered Sep 26, 2012 at 18:38

Ωmega

44k35 gold badges143 silver badges213 bronze badges

1 Comment

Dave Over a year ago

What does this produce for: Test.xx,yz.csv?

Collectives™ on Stack Overflow

Python/Regex - Match .#,#. in String

6 Answers 6

5 Comments

Comments

Comments

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

5 Comments

Comments

Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related