2

I need to split up a string into single words, but there are some cases which should not be splitted.

An example for type I string
An example for degree II string

So every type | degree + I | II | III | IV | V should be kept as a string

The result of the example strings should be

['An', 'example', 'for', 'type I', 'string']
['An', 'example', 'for', 'degree II', 'string']

In my regex I have to search for type or degree, followed by space, followed by a string with characters I or V with maximum length of 3. Those matches should not be splited.

consr regex = '/(type|degree)\s(I{1,3}|V{1})/' // <-- regEx is wrong as it is not working
const result = string.split(' ')

I'm not quite sure how to use the regex in combination with splitting in a way, that all matches are exceptions for splitting by space character.

7
  • (I{1,3}|V{1}) means “either (from one to three I) or (exactly one V)” Commented Sep 20, 2017 at 8:19
  • 2
    You might want to support all Roman numbers - regex demo (I shortened this Roman number regex a bit). Commented Sep 20, 2017 at 8:21
  • @WiktorStribiżew That regex is great. Could you provide an answer how to use this with split()? Commented Sep 20, 2017 at 8:24
  • Wrap with (...). See jsfiddle.net/fu63Lyz5. I think you will also need to match that as whole words, hence, I added \b in the demo. Commented Sep 20, 2017 at 8:24
  • Thanks. But please have a look at the example shown in the post. I need all words splitted - the regex should match to the exceptions, which should not be splitted by itself. Your JsFiddle gives an array with three items... Commented Sep 20, 2017 at 8:28

1 Answer 1

2

You may match the words type and degree followed with any Roman number or any 1+ non-whitespace chars with

var s = "An example for degree II string";
var rx = /\b(?:type|degree)\s+M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})\b|\S+/g;
console.log(s.match(rx));

I borrowed and shortened the Roman number regex from here. The pattern matches

  • \b - a word boundary
  • (?:type|degree) - a non-capturing group matching either type or degree substrings
  • \s+ - 1 or more whitespaces
  • M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}) - the Roman number regex
  • \b - a trailing word boundary (this will make sure at least 1 Roman number is present)
  • | - or
  • \S+ - 1 or more non-whitespace chars.

Note that in case any symbol or punctuation char is present in front of the degree or type words, it will be matched with \S+ branch, so you need to handle those cases before applying this regex.

Sign up to request clarification or add additional context in comments.

2 Comments

Only for better understanding, as regexes are still hard for me: If I only want to keep matched strings with minimum length of 3 characters, where do I have to add this?
@user3142695: Perhaps, \S{3,} instead of \S+.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.