2

My code is unable to detect the usage of operators along with non-english characters:

const OPERATOR_REGEX = new RegExp(
  /(?!\B"[^"|“|”]*)\b(and|or|not|exclude)(?=.*[\s])\b(?![^"|“|”]*"\B)/,
  'giu'
);

const query1 = '(Java or "化粧" or 化粧品)';
const query2 = '(Java or 化粧 or 化粧品)';

console.log(query1.split(OPERATOR_REGEX));
console.log(query2.split(OPERATOR_REGEX));

https://codepen.io/thewebtud/pen/vYraavd?editors=1111

Whereas the same code successfully detects all operators on regex101.com using the unicode flag: https://regex101.com/r/FC84BH/1

How can this be fixed for JS?

3
  • 2
    Yes, only if you target ECMAScript 2018+ standard. It is impossible for old browsers like Safari. Commented Nov 29, 2022 at 9:37
  • I have just looked at the pattern again, and it seems you actually wanted to avoid matching the or, etc. if it is preceded with " (not after a word char) that is followed with zero or more chars other than ", and , right? Then your pattern must have looked like /(?<!\B"[^"“”]*)\b(and|or|not|exclude)(?=.*\s)\b(?![^"“”]*"\B)/. That is, the first lookaround must be a negative lookbehind. Commented Dec 8, 2022 at 11:43
  • Why not separate parenthesis and words? I would have thought the split result looks like this: [ '(', 'Java', 'or', '"化粧"', 'or', '化粧品', ')'], or possibly this with quotes removed: [ '(', 'Java', 'or', '化粧', 'or', '化粧品', ')'] Commented Dec 14, 2022 at 8:29

1 Answer 1

3

Keeping in mind that

  • \b (word boundary) can be written as (?:(?<=^)(?=\w)|(?<=\w)(?=$)|(?<=\W)(?=\w)|(?<=\w)(?=\W)) and
  • \B (non-word boundary) can be written as (?:(?<=^)(?=\W)|(?<=\W)(?=$)|(?<=\W)(?=\W)|(?<=\w)(?=\w))

and that a Unicode-aware \w pattern is [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}] (see Replace certain arabic words in text string using Javascript), here is the ECMAScript 2018+ solution:

const w = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`;
const nw = String.raw`[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`;
const uwb = String.raw`(?:(?<=^)(?=${w})|(?<=${w})(?=$)|(?<=${nw})(?=${w})|(?<=${w})(?=${nw}))`;
const unwb = String.raw`(?:(?<=^)(?=${nw})|(?<=${nw})(?=$)|(?<=${nw})(?=${nw})|(?<=${w})(?=${w}))`;

const OPERATOR_REGEX = new RegExp(
  String.raw`(?!${unwb}"[^"“”]*)${uwb}(and|or|not|exclude)(?=.*\s)${uwb}(?![^"“”]*"${unwb})`,
  'giu'
);

const query1 = '(Java or "化粧" or 化粧品)';
const query2 = '(Java or 化粧 or 化粧品)';

console.log(query1.split(OPERATOR_REGEX));
console.log(query2.split(OPERATOR_REGEX));

Sign up to request clarification or add additional context in comments.

3 Comments

There are occurances of lookbehind regex in this solution which are not supported in Safari browser. Do you have any alternate for this?
@codingrohtak No, not for Safari.
@codingrohtak Safari 13 and newer now support lookbehinds, too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.