non greedy Python regex from end of string

Question

I need to search a string in Python 3 and I'm having troubles implementing a non greedy logic starting from the end.

I try to explain with an example:

Input can be one of the following

test1 = 'AB_x-y-z_XX1234567890_84481.xml' 
test2 = 'x-y-z_XX1234567890_84481.xml'
test3 = 'XX1234567890_84481.xml'

I need to find the last part of the string ending with

somestring_otherstring.xml

In all the above cases the regex should return XX1234567890_84481.xml

My best try is:

result = re.search('(_.+)?\.xml$', test1, re.I).group()
print(result)

Here I used:

(_.+)? to match "_anystring" in a non greedy mode

\.xml$ to match ".xml" in the final part of the string

The output I get is not correct:

_x-y-z_XX1234567890_84481.xml

I found some SO questions (link) explaining the regex starts from the left even with non greedy qualifier.

Could anyone explain me how to implement a non greedy regex from the right?

Why should it match _XX1234567890_84481.xml and not _84481.xml? Is it everyting after the last before the last underscore? — BlueSheepToken
– BlueSheepToken, Commented Mar 5, 2019 at 16:47
It shouldn't match _XX1234567890_84481.xml and neither _84481.xml, but only XX1234567890_84481.xml — manuel_b
– manuel_b, Commented Mar 5, 2019 at 17:01
I might have used the wrong strings, but I let them to be clear, why should it match XX1234567890_84481.xml and not 84481.xml ? — BlueSheepToken
– BlueSheepToken, Commented Mar 5, 2019 at 18:13

The fourth bird · Accepted Answer · 2019-03-05 17:23:53Z

1

Your pattern (_.+)?\.xml$ captures in an optional group from the first underscore until it can match .xml at the end of the string and it does not take the number of underscores that should be between into account.

To only match the last part you can omit the capturing group. You could use a negated character class and use the anchor $ to assert the end of the line as it is the last part:

[^_]+_[^_]+\.xml$

Regex demo | Python demo

That will match

[^_]+ Match 1+ times not _
_ Match literally
[^_]+ Match 1+ times not _
\.xml$ Match .xml at the end of the string

For example:

import re

test1 = 'AB_x-y-z_XX1234567890_84481.xml'
result = re.search('[^_]+_[^_]+\.xml$', test1, re.I)
if result:
    print(result.group())

edited Mar 5, 2019 at 17:23

answered Mar 5, 2019 at 16:50

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

knap · Accepted Answer · 2019-03-05 16:53:24Z

1

Not sure if this matches what you're looking for conceptually as "non greedy from the right" - but this pattern yields the correct answer:

'[^_]+_[^_]+\.xml$'

The [^_] is a character class matching any character which is not an underscore.

answered Mar 5, 2019 at 16:53

knap

112 bronze badges

Comments

Pushpesh Kumar Rajwanshi · Accepted Answer · 2019-03-05 16:55:01Z

1

You need to use this regex to capture what you want,

[^_]*_[^_]*\.xml

Demo

Check out this Python code,

import re

arr = ['AB_x-y-z_XX1234567890_84481.xml','x-y-z_XX1234567890_84481.xml','XX1234567890_84481.xml']

for s in arr:
 m = re.search(r'[^_]*_[^_]*\.xml', s)
 if (m):
  print(m.group(0))

Prints,

XX1234567890_84481.xml
XX1234567890_84481.xml
XX1234567890_84481.xml

The problem in your regex (_.+)?\.xml$ is, (_.+)? part will start matching from the first _ and will match anything until it sees a literal .xml and whole of it is optional too as it is followed by ?. Due to which in string _x-y-z_XX1234567890_84481.xml, it will also match _x-y-z_XX1234567890_84481 which isn't the correct behavior you desired.

edited Mar 5, 2019 at 16:55

answered Mar 5, 2019 at 16:47

Pushpesh Kumar Rajwanshi

18.4k2 gold badges22 silver badges39 bronze badges

Collectives™ on Stack Overflow

non greedy Python regex from end of string

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related