0

i have tried to extract

$str = "Instant Oatmeal - Corn Flavour 175g (35g x 5)";
preg_match('/(?P<name>.*) (?P<total_weight>\d+)(?P<total_weight_unit>.*) \((?P<unitWeight>\d+)(?P<unitWeight_unit>.*) x (?P<portion_no>\d+)\)/', $str, $m);

it is correct:

Instant Oatmeal - Corn Flavour 175g (35g x 5)
name : Instant Oatmeal - Corn Flavour
total_weight : 175 g
#portion : 5
unit weight : 35 g

However, if i want to extract

$str = "Cholcolate Sandwich Cookies (Tray) 264.6g (29.4g x 9)";

it is incorrect:

Cholcolate Sandwich Cookies (Tray) 264.6g (29.4g x 9)
name : Cholcolate Sandwich Cookies (Tray)
total_weight : 264 .6g
#portion : 9
unit weight : 29 .4g

How to solve this?

1
  • total weight should be: total_weight : 264.6g however, it becomes total_weight : 264 .6g the unit should be "g", but it is now ".6g" Commented Oct 9, 2011 at 14:32

2 Answers 2

3

Use free-spacing mode for non-trivial regexes!

When dealing with non-trivial regexes like this one, you can dramatically improve readability (and maintainability) by writing them in free-spacing format with lots of comments (and indentation for any nested parentheses). Here is your original regex in free spacing format with comments:

$re_orig = '/# Original regex with added comments.
    (?P<name>.*)               # $name:
    [ ]                        # Space separates name from weight.
    (?P<total_weight>\d+)      # $total_weight:
    (?P<total_weight_unit>.*)  # $total_weight_unit:
    [ ]                        # Space separates totalunits from .
    \(                         # Literal parens enclosing portions data.
    (?P<unitWeight>\d+)        # $unitWeight:
    (?P<unitWeight_unit>.*)    # $unitWeight_unit:
    [ ]x[ ]                    # "space-X-space" separates portions data.
    (?P<portion_no>\d+)        # $portion_no:
    \)                         # Literal parens enclosing portions data.
    /x';

Here is an improved version:

$re_improved = '/# Match Name, total weight, units and portions data.
    ^                       # Anchor to start of string.
    (?P<name>.*?)           # $name:
    [ ]+                    # Space(s) separate name from weight.
    (?P<total_weight>       # $total_weight:
      \d+                   # Required integer portion.
      (?:\.\d*)?            # Optional fractional portion.
    )
    (?P<total_weight_unit>  # $total_weight_unit:
      .+?                   # Units consist of any chars.
    )
    [ ]+                    # Space(s) separate total from portions.
    \(                      # Literal parens enclosing portions data.
    (?P<unitWeight>         # $unitWeight:
      \d+                   # Required integer portion.
      (?:\.\d*)?            # Optional fractional portion.
    )
    (?P<unitWeight_unit>    # $unitWeight_unit:
      .+?                   # Units consist of any chars.
    )
    [ ]+x[ ]+               # "space-X-space" separates portions data.
    (?P<portion_no>         # $portion_no:
      \d+                   # Required integer portion.
      (?:\.\d*)?            # Optional fractional portion.
    )
    \)                      # Literal parens enclosing portions data.
    $                       # Anchor to end of string.
    /xi';

Notes:

  • The expressions for all the numerical quantities has been improved to allow an optional fractional portion.
  • Added start and end of string anchors.
  • Added i ignorecase modifier in case the X in the portions data is uppercase.

I'm not sure how you are applying this regex, but this improved regex should solve your immediate problem.

Edit: 2011-10-09 11:17 MDT Changed expression for units to be more lax to allow for cases pointed out by Ilmari Karonen.

Sign up to request clarification or add additional context in comments.

5 Comments

\w+ might not be enough for generic unit parsing, if there are any micrograms (μg), Ohms (Ω), Angstroms (Å), degrees (°) or feet and inches (' / ") involved. But I suppose none of those are very likely to appear in cooking recipes. However, fluid ounces (fl. oz) might be a problem.
@Ilmari Karonen - very good point. Have updated answer. Thanks!
@ridgerunner -thx very much, can you explain "(?:\.\d*)?" and "/xi" to me? I don't know what they mean..thx
(?:\.\d*)? means: "Optionally match one literal dot followed by zero or more digits." I recommend spending some time learning the basic syntax. A good place to start is with the tutorials at: regular-expressions.info. But if you really want to "know" regex (in the Neo: "I know kung-fu!" sense), I highly recommend: Mastering Regular Expressions (3rd Edition). This was the most useful book I have ever read.
Regarding: /xi, the / is the closing regex delimiter and the x and i are modifiers that tell the regex-engine to 1.) use free-spacing mode (x), and 2.) ignore case (i). See: PHP PCRE Pattern Modifiers.
2

Use this :

/(?P<name>.*) (?P<total_weight>\b[0-9]*\.?[0-9]+)(?P<total_weight_unit>.*) \((?P<unitWeight>\b[0-9]*\.?[0-9]+)(?P<unitWeight_unit>.*) x (?P<portion_no>\d+)\)/

Your problem is that you are not taking into account floating point numbers. I corrected this. Note that the portion is still an integer but I guess this is logical :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.