2

I have a line of text as shown below and I want to extract the amount in it,

Your bill of USD 17.99 is due on 09-01-2002

And I have written the the regular expression like below, after considering the above line as String,

s = 'Your bill of USD 17.99 is due on 09-01-2002'

match = re.search( r'bill of.*([0-9]*\.[0-9]{2})', s.lower() )
if match: 
    print match.group(1)

It prints,

.99

But I want it to print 17.99

I just don't seem to understand why is not capturing the whole amount. I think it has to do something with greedy aspect of regular expressions. Any suggestion would be great help.

1
  • You've got plenty of good answers. Just remember how regexes match: first, do everything possible to succeed. If there's any weird combination of zero-length matches and minimal matches and maximal matches at all to make the match work, it will work. Second, If there's more than one way to make the match succeed, then the specific match chosen is the longest version of the leftmost match: start as early as possible and match as much as possible. This applies to the pieces as well as the whole, which is why .* eats your 17. Commented May 5, 2015 at 10:35

6 Answers 6

5

Your problem is that * means zero or more, . includes digits, and the capturing is greedy (i.e. the earlier expression .* is 'stealing' all of the numbers). See this demo: https://regex101.com/r/vN5vJ5/1

Instead, make it match all non-digits prior to the start of the number (and use \d rather than [0-9] for digits within the number):

>>> import re
>>> s = 'Your bill of USD 17.99 is due on 09-01-2002'
>>> re.findall(r'bill of\D*(\d*\.\d{2})', s)
['17.99']

Updated demo: https://regex101.com/r/vN5vJ5/4

If your format doesn't allow e.g. USD .99 (rather than USD 0.99), consider making the first digit capture "one or more" (+) rather than "zero or more" (*).

Sign up to request clarification or add additional context in comments.

3 Comments

Does using re.findall() have performance issue? I mean re.search will stop as soon it finds a match but re.findall() will keep looking even if it found match. Is it this way re.findall() works?
@malee I just did that for a quick demo to paste into my answer, rather than adding the step of extracting the captured group from the match; if you're only looking for one cost from the text use re.search.
Yes I'm using re.search because I've to stop after first match
0

Your regex was greedy .*, try this instead

import re
s = 'Your bill of USD 17.99 is due on 09-01-2002'

match = re.search( r"bill.*?([\d]+\.[\d]{2})", s.lower() )
if match: 
    print match.group(1)

Demo

http://ideone.com/66mF8w

Comments

0

You just need to use

match = re.search( r"[a-zA-Z\ ]+([0-9\.]+)\ .*", s.lower() )

Comments

0

Make your .* non greedy (because greedy people tend to eat as much as possible :P) by adding ? i.e => .*?.. you can use the following:

'bill of.*?([0-9]*\.[0-9]{2})'
          ^
    (see the change)

i.e:

match = re.search( r'bill of.*?([0-9]*\.[0-9]{2})', s.lower() )

Comments

0

Try to use:

'bill of [\D]*([0-9]*\.[0-9]{2})'

The .* after 'of' catches also the '17'.

Comments

0

Because the * matches [0-9] zero or more times, the preceding .* eats the 17. You can use this:

match = re.search( r'bill of.*?([0-9]*\.[0-9]{2})', s.lower() )

The question mark in .*? makes it non-greedy. And you could add + after the character class to require at least one hit.

2 Comments

Just adding + would get '7.99', as .* would still "eat" the '1'.
Using * instead of + would not remove the problem. In OP you would then change the match from .99 to 7.99 rather than to 17.99.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.