Extract particular value using regular expression in Python

Question

I have a line of text as shown below and I want to extract the amount in it,

Your bill of USD 17.99 is due on 09-01-2002

And I have written the the regular expression like below, after considering the above line as String,

s = 'Your bill of USD 17.99 is due on 09-01-2002'

match = re.search( r'bill of.*([0-9]*\.[0-9]{2})', s.lower() )
if match: 
    print match.group(1)

It prints,

.99

But I want it to print 17.99

I just don't seem to understand why is not capturing the whole amount. I think it has to do something with greedy aspect of regular expressions. Any suggestion would be great help.

You've got plenty of good answers. Just remember how regexes match: first, do everything possible to succeed. If there's any weird combination of zero-length matches and minimal matches and maximal matches at all to make the match work, it will work. Second, If there's more than one way to make the match succeed, then the specific match chosen is the longest version of the leftmost match: start as early as possible and match as much as possible. This applies to the pieces as well as the whole, which is why .* eats your 17. — Mark Reed
– Mark Reed, Commented May 5, 2015 at 10:35

jonrsharpe · Accepted Answer · 2015-05-05 10:35:19Z

5

Your problem is that * means zero or more, . includes digits, and the capturing is greedy (i.e. the earlier expression .* is 'stealing' all of the numbers). See this demo: https://regex101.com/r/vN5vJ5/1

Instead, make it match all non-digits prior to the start of the number (and use \d rather than [0-9] for digits within the number):

>>> import re
>>> s = 'Your bill of USD 17.99 is due on 09-01-2002'
>>> re.findall(r'bill of\D*(\d*\.\d{2})', s)
['17.99']

Updated demo: https://regex101.com/r/vN5vJ5/4

If your format doesn't allow e.g. USD .99 (rather than USD 0.99), consider making the first digit capture "one or more" (+) rather than "zero or more" (*).

edited May 5, 2015 at 10:35

answered May 5, 2015 at 10:27

jonrsharpe

123k31 gold badges277 silver badges488 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mohd Ali Over a year ago

Does using re.findall() have performance issue? I mean re.search will stop as soon it finds a match but re.findall() will keep looking even if it found match. Is it this way re.findall() works?

jonrsharpe Over a year ago

@malee I just did that for a quick demo to paste into my answer, rather than adding the step of extracting the captured group from the match; if you're only looking for one cost from the text use re.search.

Mohd Ali Over a year ago

Yes I'm using re.search because I've to stop after first match

Pedro Lobito · Accepted Answer · 2015-05-05 10:26:33Z

0

Your regex was greedy .*, try this instead

import re
s = 'Your bill of USD 17.99 is due on 09-01-2002'

match = re.search( r"bill.*?([\d]+\.[\d]{2})", s.lower() )
if match: 
    print match.group(1)

Demo

http://ideone.com/66mF8w

answered May 5, 2015 at 10:26

Pedro Lobito

99.8k36 gold badges274 silver badges278 bronze badges

Comments

duknust · Accepted Answer · 2015-05-05 10:29:12Z

0

You just need to use

match = re.search( r"[a-zA-Z\ ]+([0-9\.]+)\ .*", s.lower() )

answered May 5, 2015 at 10:29

duknust

1742 silver badges13 bronze badges

Comments

karthik manchala · Accepted Answer · 2015-05-05 10:30:30Z

0

Make your .* non greedy (because greedy people tend to eat as much as possible :P) by adding ? i.e => .*?.. you can use the following:

'bill of.*?([0-9]*\.[0-9]{2})'
          ^
    (see the change)

i.e:

match = re.search( r'bill of.*?([0-9]*\.[0-9]{2})', s.lower() )

edited May 5, 2015 at 10:30

answered May 5, 2015 at 10:24

karthik manchala

13.7k1 gold badge34 silver badges55 bronze badges

Comments

karthik manchala · Accepted Answer · 2015-05-05 10:32:42Z

0

Try to use:

'bill of [\D]*([0-9]*\.[0-9]{2})'

The .* after 'of' catches also the '17'.

edited May 5, 2015 at 10:32

karthik manchala

13.7k1 gold badge34 silver badges55 bronze badges

answered May 5, 2015 at 10:29

Owbea

12 bronze badges

Comments

mart1n · Accepted Answer · 2015-05-05 12:51:14Z

0

Because the * matches [0-9] zero or more times, the preceding .* eats the 17. You can use this:

match = re.search( r'bill of.*?([0-9]*\.[0-9]{2})', s.lower() )

The question mark in .*? makes it non-greedy. And you could add + after the character class to require at least one hit.

edited May 5, 2015 at 12:51

answered May 5, 2015 at 10:25

mart1n

6,3118 gold badges51 silver badges86 bronze badges

2 Comments

jonrsharpe Over a year ago

Just adding + would get '7.99', as .* would still "eat" the '1'.

Taemyr Over a year ago

Using * instead of + would not remove the problem. In OP you would then change the match from .99 to 7.99 rather than to 17.99.

Collectives™ on Stack Overflow

Extract particular value using regular expression in Python

6 Answers 6

3 Comments

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related