1

I am trying to parse and tokenize recipes. Ingredients can be written in a 2 main ways:

Style 1

1 Ripe Avocado

1x Ripe Avocado - x is optional and sometimes present

OR:

Style 2

1 Ripe Avocado (lrg) 123

1x Ripe Avocado (lrg) 123 - if the abbreviation present so is an item code integer

I am trying to a) detect if it is a match for Style 1 or 2 and b) tokenize into the following capture-groups.

[1][Ripe Avocado][lrg]?[123]?

I can't seem to consistently parse this, so any help would be much appreciated!

Edit:

^(\d+)x? ([a-zA-Z0-9_', -]+) is what I had but it didn't account for the optional capture groups in Style 2.

4
  • 1
    Are you really using [] in your regex? That's for defining character classes, not groups. Commented Jun 24, 2019 at 11:40
  • 1
    Also (?:) would be a non-capturing group, as opposed to an optional group. Commented Jun 24, 2019 at 11:41
  • 1
    Can you please share the regex you are trying? From what I understand you probably need something like: (\d)x?\s(\w.*)(\s(lrg)\ (\d.*))? Commented Jun 24, 2019 at 11:43
  • 1
    @VLAZ - sorry that is just psuedo-code to show my ideal outcome. Commented Jun 24, 2019 at 11:44

2 Answers 2

2

You could use a pattern with an optional second part for the a abbreviation and the item code integer. The values you could capture in a capturing group giving you 2 groups and 2 optional groups.

If you want to match whitespace characters instead of a space only, you could use \s instead.

Assuming these are words and can be matched by using word characters \w, you might use:

\b(\d+)x? (\w+(?: \w+)*)(?: \(([^()]+)\) (\d+))?\b

Explanation

(with a space denoted as [ ] for clarity)

  • \b Word boundary
  • (\d+)x? Capture group 1, match 1+ digits then match optional x
  • [ ](\w+(?: \w+)*) Match a space, then capture in group 2 matching 1+ word chars and repeat 0+ times a space and 1+ word chars
  • (?: Non capturing group
    • [ ]\( Match space and (
    • ([^()]+) capturing group 3, match not () using a negated character class
    • \) Match )
    • Match a space and capture in group 4 matching 1+ digits
  • )? Close non capturing group and make it optional so group 3 and 4 are optional
  • \b Word boundary

Regex demo

Sign up to request clarification or add additional context in comments.

Comments

2

Seems to me that Style 1 and Style 2 are very similar. I would use this regex to extract all the necessary groups:

/(\d+).? ([\w ]*) ?(?>\((.*)\) (.*))?/

Then, you can determine if it's Style 1 or Style 2 based on the presence of matching groups 3 and 4.

FYI, you can use the very useful regex101 to validate regexps: https://regex101.com/r/0LYxdc/1

Cheers

Lucas

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.