0

I have a small problem with regex. I need to parse all words from string behind:

word - word2, word3, word4

Have tried to solve, but it is returning only last iteration

(\w+) - ((\w+)[, ]{0,2})+

https://regex101.com/r/2Uot2M/1

Thank you for any help.

P.S: I can't just match all words like (\w+). I need to match string with the format above.

5
  • On Stack Overflow, you are expected to try to write the code yourself. After doing more research if you have a problem you can post what you've tried with a clear explanation of what isn't working and providing a Minimal, Complete, and Verifiable example within the question itself. Commented Jan 23, 2018 at 23:37
  • @Simon I'm not going to do the work for him. Commented Jan 23, 2018 at 23:43
  • Well, if you use your regex in .NET, you already have what you need, the values are in Group 3 capture collection, see Table tab and expand $3 captures. Or, a PCRE solution will look uglier, (?:\G(?!^)|^(?=\w+ - (?:\w+[, ]{0,2})+$))\W*\K\w+. Commented Jan 23, 2018 at 23:49
  • @WiktorStribiżew Thanks for the answer. Your PCRE is working fine :) I will try to simplify it. Can you post it as answer? I'll mark it. Commented Jan 24, 2018 at 0:05
  • Well, it took time to explain, see the answer below. Commented Jan 24, 2018 at 0:12

3 Answers 3

1

If you are using PCRE regex library and you need to pre-validate a string before extracting words from it, you may use the following pattern:

(?:\G(?!^)|^(?=\w+ - (?:\w+[, ]{0,2})+$))\W*\K\w+

See the regex demo.

How it works

  • (?:\G(?!^)|^(?=\w+ - (?:\w+[, ]{0,2})+$)) - either the end of the previous match (\G(?!^)) or (|) start of a string (^) that is followed with the following pattern:
    • \w+ - 1+ word chars
    • - - a hyphen enclosed with single spaces
    • (?:\w+[, ]{0,2})+ - 1+ occurrences of:
      • \w+ - 1+ word chars
      • [, ]{0,2} - 0 to 2 occurrences of a space or comma
    • $ - end of string
  • \W* - 0+ non-word chars
  • \K - a match reset operator that discards all text matched so far from Group 0 (whole match) buffer
  • \w+ - 1+ word chars.
Sign up to request clarification or add additional context in comments.

Comments

1

If you're just looking to capture each word into a separate capture group, you can just use this regex: (\w+)

This captures all substrings with one or more word characters (letters or numbers). It will ignore the whitespace and punctuation. On Regex 101 it captures 'word', 'word2', 'word3', and 'word4' into separate capture groups

Helpful link on word characters: https://www.w3schools.com/jsref/jsref_regexp_wordchar.asp

Helpful link on quantifiers ('+' is a quantifier): https://learn.microsoft.com/en-us/dotnet/standard/base-types/quantifiers-in-regular-expressions

9 Comments

Yep. Thanks for the answer. But i need to parse strictly with the format. I can't just match all words :)
@Rob, added some more explanation & helpful links.
Links for answers are not allowed on SO and will get your answer closed. Form your own answer here or delete this altogether. stackoverflow.com/help/how-to-answer
@Rob Hi, just want to get advice, not a solution. Calm down, man :)
@kpa6 Nice. This is a great place but violating Stack Overflow policy and rules is never a good thing.
|
1

No validation needed except with first word-word.
Using the \G anchor and a branch reset, will fill an array
where the words collect in capture group 1.

(?|(\w+)[ ]*-[ ]*(?=\w)|(?!^)\G[ ]*,?[ ]*(\w+))

https://regex101.com/r/deZq5u/1

Note no need for BOS or EOS anchor's which are crutches.
This will find valid matches mid-string as it should.

Formatted and tested

(Note the # Optional spaces, single comma, spaces will always match
either a space or a comma or both, even though optional, and is a required
separator. This is because the \w+ clause will not leave any behind.)

 (?|                           # Branch reset
      ( \w+ )                       # (1), First word
      [ ]* - [ ]*                   # qualified with a dash,
      (?= \w )                      # then a lookahead for next word
   |                              # or,
      (?! ^ )                       # Reset \G at BOS
      \G                            # Anchor, second or more match
      [ ]* ,? [ ]*                  # Optional spaces, single comma, spaces
      ( \w+ )                       # (1), Second or more word
 )                             # End branch reset

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.