Parse string with Regex - optional capture groups

Question

I am trying to parse and tokenize recipes. Ingredients can be written in a 2 main ways:

Style 1

1 Ripe Avocado

1x Ripe Avocado - x is optional and sometimes present

OR:

Style 2

1 Ripe Avocado (lrg) 123

1x Ripe Avocado (lrg) 123 - if the abbreviation present so is an item code integer

I am trying to a) detect if it is a match for Style 1 or 2 and b) tokenize into the following capture-groups.

[1][Ripe Avocado][lrg]?[123]?

I can't seem to consistently parse this, so any help would be much appreciated!

Edit:

^(\d+)x? ([a-zA-Z0-9_', -]+) is what I had but it didn't account for the optional capture groups in Style 2.

Are you really using [] in your regex? That's for defining character classes, not groups. — VLAZ
– VLAZ, Commented Jun 24, 2019 at 11:40
Also (?:) would be a non-capturing group, as opposed to an optional group. — VLAZ
– VLAZ, Commented Jun 24, 2019 at 11:41
Can you please share the regex you are trying? From what I understand you probably need something like: (\d)x?\s(\w.*)(\s(lrg)\ (\d.*))? — abelgana
– abelgana, Commented Jun 24, 2019 at 11:43
@VLAZ - sorry that is just psuedo-code to show my ideal outcome. — bsb_coffee
– bsb_coffee, Commented Jun 24, 2019 at 11:44

The fourth bird · Accepted Answer · 2019-06-24 12:01:13Z

You could use a pattern with an optional second part for the a abbreviation and the item code integer. The values you could capture in a capturing group giving you 2 groups and 2 optional groups.

If you want to match whitespace characters instead of a space only, you could use \s instead.

Assuming these are words and can be matched by using word characters \w, you might use:

\b(\d+)x? (\w+(?: \w+)*)(?: \(([^()]+)\) (\d+))?\b

Explanation

(with a space denoted as [ ] for clarity)

\b Word boundary
(\d+)x? Capture group 1, match 1+ digits then match optional x
[ ](\w+(?: \w+)*) Match a space, then capture in group 2 matching 1+ word chars and repeat 0+ times a space and 1+ word chars
(?: Non capturing group
- [ ]\( Match space and (
- ([^()]+) capturing group 3, match not () using a negated character class
- \) Match )
- Match a space and capture in group 4 matching 1+ digits
)? Close non capturing group and make it optional so group 3 and 4 are optional
\b Word boundary

Regex demo

Lucas · Accepted Answer · 2019-06-24 12:08:49Z

2

Seems to me that Style 1 and Style 2 are very similar. I would use this regex to extract all the necessary groups:

/(\d+).? ([\w ]*) ?(?>\((.*)\) (.*))?/

Then, you can determine if it's Style 1 or Style 2 based on the presence of matching groups 3 and 4.

FYI, you can use the very useful regex101 to validate regexps: https://regex101.com/r/0LYxdc/1

Cheers

Lucas

edited Jun 24, 2019 at 12:08

answered Jun 24, 2019 at 11:54

Lucas

2,1532 gold badges20 silver badges19 bronze badges

Collectives™ on Stack Overflow

Parse string with Regex - optional capture groups

Style 1

Style 2

Edit:

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Style 1

Style 2

Edit:

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related