1

I need to search a string in Python 3 and I'm having troubles implementing a non greedy logic starting from the end.

I try to explain with an example:

Input can be one of the following

test1 = 'AB_x-y-z_XX1234567890_84481.xml' 
test2 = 'x-y-z_XX1234567890_84481.xml'
test3 = 'XX1234567890_84481.xml'

I need to find the last part of the string ending with

somestring_otherstring.xml

In all the above cases the regex should return XX1234567890_84481.xml

My best try is:

result = re.search('(_.+)?\.xml$', test1, re.I).group()
print(result)

Here I used:

(_.+)? to match "_anystring" in a non greedy mode

\.xml$ to match ".xml" in the final part of the string

The output I get is not correct:

_x-y-z_XX1234567890_84481.xml

I found some SO questions (link) explaining the regex starts from the left even with non greedy qualifier.

Could anyone explain me how to implement a non greedy regex from the right?

3
  • Why should it match _XX1234567890_84481.xml and not _84481.xml? Is it everyting after the last before the last underscore? Commented Mar 5, 2019 at 16:47
  • It shouldn't match _XX1234567890_84481.xml and neither _84481.xml, but only XX1234567890_84481.xml Commented Mar 5, 2019 at 17:01
  • I might have used the wrong strings, but I let them to be clear, why should it match XX1234567890_84481.xml and not 84481.xml ? Commented Mar 5, 2019 at 18:13

3 Answers 3

1

Your pattern (_.+)?\.xml$ captures in an optional group from the first underscore until it can match .xml at the end of the string and it does not take the number of underscores that should be between into account.

To only match the last part you can omit the capturing group. You could use a negated character class and use the anchor $ to assert the end of the line as it is the last part:

[^_]+_[^_]+\.xml$

Regex demo | Python demo

That will match

  • [^_]+ Match 1+ times not _
  • _ Match literally
  • [^_]+ Match 1+ times not _
  • \.xml$ Match .xml at the end of the string

For example:

import re

test1 = 'AB_x-y-z_XX1234567890_84481.xml'
result = re.search('[^_]+_[^_]+\.xml$', test1, re.I)
if result:
    print(result.group())
Sign up to request clarification or add additional context in comments.

Comments

1

Not sure if this matches what you're looking for conceptually as "non greedy from the right" - but this pattern yields the correct answer:

'[^_]+_[^_]+\.xml$'

The [^_] is a character class matching any character which is not an underscore.

Comments

1

You need to use this regex to capture what you want,

[^_]*_[^_]*\.xml

Demo

Check out this Python code,

import re

arr = ['AB_x-y-z_XX1234567890_84481.xml','x-y-z_XX1234567890_84481.xml','XX1234567890_84481.xml']

for s in arr:
 m = re.search(r'[^_]*_[^_]*\.xml', s)
 if (m):
  print(m.group(0))

Prints,

XX1234567890_84481.xml
XX1234567890_84481.xml
XX1234567890_84481.xml

The problem in your regex (_.+)?\.xml$ is, (_.+)? part will start matching from the first _ and will match anything until it sees a literal .xml and whole of it is optional too as it is followed by ?. Due to which in string _x-y-z_XX1234567890_84481.xml, it will also match _x-y-z_XX1234567890_84481 which isn't the correct behavior you desired.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.