2

I'm trying to use python and regex to get the last set of integers in a filename (string) Which the method does what i need, however I want to also return the inverse or remaining parts of the regex. How can i do that?

Here is the regex ([0-9]+|#+)(?!.*([0-9]+|#+))

import re

values = [
    'image.0001',
    'image###',
    '###image###',
    'image001',
    'image_001',
    '001',
    '0001.image',
    '001image',
    '001_image',
    'image',
    '01_image01',
    '03_image01',
]

pattern = '([0-9]+|#+|@+)'
regex = '{0}(?!.*{0})'.format(pattern)

for v in values:
    result = re.search(regex, v)
    if result:
        print result.groups()

Currently it is returning.... ('01', None) I'd like it to return something like ('image', '0001')

Updated

Optionally is there a way to split the strings by groups of numbers...for example

'image.0001' > ['image.', '0001']
'image###' > ['image', '###']
'###image###' > ['###', 'image', '###']
'image001' > ['image', '001']
'image_001' > ['image_', '001']
'001' > ['001']
'0001.image' > ['0001', '.image']
'001image' > ['001', 'image']
'001_image' > ['001', '_image']
'image' > ['image']
'01_image01' > ['01', '_image', '01']
'03_image01' > ['03', '_image', '01']
4
  • Have you tried with re.findall(...) ? See docs.python.org/3/library/re.html#re.findall Commented Jan 7, 2021 at 21:14
  • What are the expected outputs for 0001.image, 001image, 001_image and image? Commented Jan 7, 2021 at 21:15
  • that's a good question, is there a way for me to return a dict that returns known parts like prefix = all bits and num = last digit occurence? Commented Jan 7, 2021 at 21:17
  • check below. You just need to sub all non-letter and all non-numbers to have either one or the other. In what I answered I followed your "last digit" requirement. Commented Jan 7, 2021 at 21:18

2 Answers 2

1

EDIT:

Use

re.findall(r'\d+|#+|@+|[^#@\d]+', v)

See proof.

Explanation

--------------------------------------------------------------------------------
  \d+                      digits (0-9) (1 or more times (matching
                           the most amount possible))
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  #+                       '#' (1 or more times (matching the most
                           amount possible))
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  @+                       '@' (1 or more times (matching the most
                           amount possible))
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  [^#@\d]+                 any character except: '#', '@', digits (0-
                           9) (1 or more times (matching the most
                           amount possible))

ORIGINAL: Use re.split, add capturing group to keep captured part inside the result:

import re

values = [
    'image.0001',
    'image###',
    '###image###',
    'image001',
    'image_001',
    '001',
    '0001.image',
    '001image',
    '001_image',
    'image',
    '01_image01',
    '03_image01',
]

pattern = '[0-9]+|#+|@+'
regex = re.compile(r'({0})(?!.*(?:{0}))'.format(pattern))
for v in values:
    print(regex.split(v))

See Python proof

Results:

['image.', '0001', '']
['image', '###', '']
['###image', '###', '']
['image', '001', '']
['image_', '001', '']
['', '001', '']
['', '0001', '.image']
['', '001', 'image']
['', '001', '_image']
['image']
['01_image', '01', '']
['03_image', '01', '']

See regex proof.

Explanation

--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [0-9]+                   any character of: '0' to '9' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    #+                       '#' (1 or more times (matching the most
                             amount possible))
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    @+                       '@' (1 or more times (matching the most
                             amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      [0-9]+                   any character of: '0' to '9' (1 or
                               more times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      #+                       '#' (1 or more times (matching the
                               most amount possible))
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      @+                       '@' (1 or more times (matching the
                               most amount possible))
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
  )                        end of look-ahead
Sign up to request clarification or add additional context in comments.

5 Comments

Is there a way to simply split the string by consecutive numbers? That may work better to avoid the random empty parts of the list
@JokerMartini Not sure what you mean. Removing empty items is easy, list(filter(None, result)).
I've update the question above to show what i mean. Your' solution is on the right track for what im doing but now seeing it in action i think the updated question would provide a better solution
How do i modifier it to split not just numbers and words but also #+|@+ like i have above in my code
@JokerMartini re.findall(r'\d+|#+|@+|[^#@\d]+', v)
0
import re

values = [
    'image.0001',
    'image###',
    '###image###',
    'image001',
    'image_001',
    '001',
    '0001.image',
    '001image',
    '001_image',
    'image',
    '01_image01',
    '03_image01',
]

for v in values:
    print (re.sub(r"[^A-Za-z]+","",v), end = " ")
    print (re.sub(r"(.+[_.]){0,1}[^0-9]+","",v))

Output:

image 0001
image 
image 
image 001
image 001
 001
image 
image 001
image 
image 
image 01
image 01

3 Comments

last occurrence. So '0001.image', should still return a number
You you don't want "the last set of integers in a filename "
I want the last occurrence. I'll update question

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.