What regex will match repeat patterns in the string below?

Question

I have a small problem with regex. I need to parse all words from string behind:

word - word2, word3, word4

Have tried to solve, but it is returning only last iteration

(\w+) - ((\w+)[, ]{0,2})+

https://regex101.com/r/2Uot2M/1

Thank you for any help.

P.S: I can't just match all words like (\w+). I need to match string with the format above.

On Stack Overflow, you are expected to try to write the code yourself. After doing more research if you have a problem you can post what you've tried with a clear explanation of what isn't working and providing a Minimal, Complete, and Verifiable example within the question itself. — Rob
– Rob, Commented Jan 23, 2018 at 23:37
Well, if you use your regex in .NET, you already have what you need, the values are in Group 3 capture collection, see Table tab and expand $3 captures. Or, a PCRE solution will look uglier, (?:\G(?!^)|^(?=\w+ - (?:\w+[, ]{0,2})+$))\W*\K\w+. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jan 23, 2018 at 23:49
@WiktorStribiżew Thanks for the answer. Your PCRE is working fine :) I will try to simplify it. Can you post it as answer? I'll mark it. — Emilien Vidal
– Emilien Vidal, Commented Jan 24, 2018 at 0:05

Wiktor Stribiżew · Accepted Answer · 2018-01-24 00:11:36Z

1

If you are using PCRE regex library and you need to pre-validate a string before extracting words from it, you may use the following pattern:

(?:\G(?!^)|^(?=\w+ - (?:\w+[, ]{0,2})+$))\W*\K\w+

See the regex demo.

How it works

(?:\G(?!^)|^(?=\w+ - (?:\w+[, ]{0,2})+$)) - either the end of the previous match (\G(?!^)) or (|) start of a string (^) that is followed with the following pattern:
- \w+ - 1+ word chars
- - - a hyphen enclosed with single spaces
- (?:\w+[, ]{0,2})+ - 1+ occurrences of:
  - \w+ - 1+ word chars
  - [, ]{0,2} - 0 to 2 occurrences of a space or comma
- $ - end of string
\W* - 0+ non-word chars
\K - a match reset operator that discards all text matched so far from Group 0 (whole match) buffer
\w+ - 1+ word chars.

answered Jan 24, 2018 at 0:11

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

rmdelgad · Accepted Answer · 2018-01-23 23:49:03Z

1

If you're just looking to capture each word into a separate capture group, you can just use this regex: (\w+)

This captures all substrings with one or more word characters (letters or numbers). It will ignore the whitespace and punctuation. On Regex 101 it captures 'word', 'word2', 'word3', and 'word4' into separate capture groups

Helpful link on word characters: https://www.w3schools.com/jsref/jsref_regexp_wordchar.asp

Helpful link on quantifiers ('+' is a quantifier): https://learn.microsoft.com/en-us/dotnet/standard/base-types/quantifiers-in-regular-expressions

edited Jan 23, 2018 at 23:49

answered Jan 23, 2018 at 23:37

rmdelgad

961 silver badge6 bronze badges

9 Comments

Emilien Vidal Over a year ago

Yep. Thanks for the answer. But i need to parse strictly with the format. I can't just match all words :)

rmdelgad Over a year ago

@Rob, added some more explanation & helpful links.

Rob Over a year ago

Links for answers are not allowed on SO and will get your answer closed. Form your own answer here or delete this altogether. stackoverflow.com/help/how-to-answer

Emilien Vidal Over a year ago

@Rob Hi, just want to get advice, not a solution. Calm down, man :)

Rob Over a year ago

@kpa6 Nice. This is a great place but violating Stack Overflow policy and rules is never a good thing.

|

score 1 · Accepted Answer · 2018-01-24 02:11:07Z

No validation needed except with first word-word.
Using the \G anchor and a branch reset, will fill an array
where the words collect in capture group 1.

(?|(\w+)[ ]*-[ ]*(?=\w)|(?!^)\G[ ]*,?[ ]*(\w+))

https://regex101.com/r/deZq5u/1

Note no need for BOS or EOS anchor's which are crutches.
This will find valid matches mid-string as it should.

Formatted and tested

(Note the # Optional spaces, single comma, spaces will always match
either a space or a comma or both, even though optional, and is a required
separator. This is because the \w+ clause will not leave any behind.)

 (?|                           # Branch reset
      ( \w+ )                       # (1), First word
      [ ]* - [ ]*                   # qualified with a dash,
      (?= \w )                      # then a lookahead for next word
   |                              # or,
      (?! ^ )                       # Reset \G at BOS
      \G                            # Anchor, second or more match
      [ ]* ,? [ ]*                  # Optional spaces, single comma, spaces
      ( \w+ )                       # (1), Second or more word
 )                             # End branch reset

Collectives™ on Stack Overflow

What regex will match repeat patterns in the string below?

3 Answers 3

Comments

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related