unicode flag not working for RegEx in Javascript

Question

My code is unable to detect the usage of operators along with non-english characters:

const OPERATOR_REGEX = new RegExp(
  /(?!\B"[^"|“|”]*)\b(and|or|not|exclude)(?=.*[\s])\b(?![^"|“|”]*"\B)/,
  'giu'
);

const query1 = '(Java or "化粧" or 化粧品)';
const query2 = '(Java or 化粧 or 化粧品)';

console.log(query1.split(OPERATOR_REGEX));
console.log(query2.split(OPERATOR_REGEX));

https://codepen.io/thewebtud/pen/vYraavd?editors=1111

Whereas the same code successfully detects all operators on regex101.com using the unicode flag: https://regex101.com/r/FC84BH/1

How can this be fixed for JS?

Yes, only if you target ECMAScript 2018+ standard. It is impossible for old browsers like Safari. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 29, 2022 at 9:37
I have just looked at the pattern again, and it seems you actually wanted to avoid matching the or, etc. if it is preceded with " (not after a word char) that is followed with zero or more chars other than ", “ and ”, right? Then your pattern must have looked like /(?<!\B"[^"“”]*)\b(and|or|not|exclude)(?=.*\s)\b(?![^"“”]*"\B)/. That is, the first lookaround must be a negative lookbehind. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Dec 8, 2022 at 11:43
Why not separate parenthesis and words? I would have thought the split result looks like this: [ '(', 'Java', 'or', '"化粧"', 'or', '化粧品', ')'], or possibly this with quotes removed: [ '(', 'Java', 'or', '化粧', 'or', '化粧品', ')'] — Peter Thoeny
– Peter Thoeny, Commented Dec 14, 2022 at 8:29

Wiktor Stribiżew · Accepted Answer · 2022-11-29 09:46:04Z

3

Keeping in mind that

\b (word boundary) can be written as (?:(?<=^)(?=\w)|(?<=\w)(?=$)|(?<=\W)(?=\w)|(?<=\w)(?=\W)) and
\B (non-word boundary) can be written as (?:(?<=^)(?=\W)|(?<=\W)(?=$)|(?<=\W)(?=\W)|(?<=\w)(?=\w))

and that a Unicode-aware \w pattern is [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}] (see Replace certain arabic words in text string using Javascript), here is the ECMAScript 2018+ solution:

const w = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`;
const nw = String.raw`[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`;
const uwb = String.raw`(?:(?<=^)(?=${w})|(?<=${w})(?=$)|(?<=${nw})(?=${w})|(?<=${w})(?=${nw}))`;
const unwb = String.raw`(?:(?<=^)(?=${nw})|(?<=${nw})(?=$)|(?<=${nw})(?=${nw})|(?<=${w})(?=${w}))`;

const OPERATOR_REGEX = new RegExp(
  String.raw`(?!${unwb}"[^"“”]*)${uwb}(and|or|not|exclude)(?=.*\s)${uwb}(?![^"“”]*"${unwb})`,
  'giu'
);

const query1 = '(Java or "化粧" or 化粧品)';
const query2 = '(Java or 化粧 or 化粧品)';

console.log(query1.split(OPERATOR_REGEX));
console.log(query2.split(OPERATOR_REGEX));

answered Nov 29, 2022 at 9:46

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

codingrohtak Over a year ago

There are occurances of lookbehind regex in this solution which are not supported in Safari browser. Do you have any alternate for this?

Wiktor Stribiżew Over a year ago

@codingrohtak No, not for Safari.

Wiktor Stribiżew Over a year ago

@codingrohtak Safari 13 and newer now support lookbehinds, too.

Collectives™ on Stack Overflow

unicode flag not working for RegEx in Javascript

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related