28

Recently I have created a regex, for my PHP code which allows only the letters (including special characters plus spaces), but now I'm having a problem with converting it (?) into the JavaScript compatible regex, here it is: /^[\s\p{L}]+$/u, the problem is the /u modifier at the end of the regex pattern, as the JavaScript doesn't allow such flag.

How can I rewrite this, so it will work in the JavaScript as well?

Is there something to allow only Polish characters: Ł, Ą, Ś, Ć, ...

3
  • 3
    Perhaps this answer will be helpful here. Commented Oct 15, 2012 at 13:49
  • 1
    Are you sure you need the u flag? Have you tried removing it and testing the expression? Commented Oct 15, 2012 at 13:52
  • 1
    @cammil "u" is required so the "\p{L}" is recognized as checking for UTF-8 letters. Commented Oct 15, 2012 at 13:55

4 Answers 4

23

The /u modifier is for unicode support. Support for it was added to JavaScript in ES2015.

Read http://stackoverflow.com/questions/280712/javascript-unicode to learn more information about unicode in regex with JavaScript.


Polish characters:

Ą \u0104
Ć \u0106
Ę \u0118
Ł \u0141
Ń \u0143
Ó \u00D3
Ś \u015A
Ź \u0179
Ż \u017B
ą \u0105
ć \u0107
ę \u0119
ł \u0142
ń \u0144
ó \u00F3
ś \u015B
ź \u017A
ż \u017C

All special Polish characters:

[\u0104\u0106\u0118\u0141\u0143\u00D3\u015A\u0179\u017B\u0105\u0107\u0119\u0142\u0144\u00F3\u015B\u017A\u017C]
Sign up to request clarification or add additional context in comments.

6 Comments

One might argue that the modifier isn't needed in any language/environment that properly handles Unicode instead of a mishmash of binary data and actual Unicode text in strings such as PHP.
@Joey - The PHP preg functions, which are based on PCRE, support Unicode when the /u option is appended to the regular expression.
@Scott - Polish language use latin, so go with ranges [\u0000-\u007F] = Basic Latin; [\u0080-\u00FF] = Latin-1 Supplement; [\u0100-\u017F] = Latin Extended-A; [\u0180-\u024F] = Latin Extended-B; ... which together get [\u0000-\u024F] to include all latin characters :)
Ωmega, I know why the flag is needed in PCRE and fundamentally it's the problem that PHP doesn't have a defined character set for strings, leading to some strings being in some legacy character set, some in UTF-8, some storing even non-text binary data. Environments such as Java or .NET have it far easier in that regard, given that text is always Unicode.
This answer is one of the first results on Google when searching for "regex u flag", so you might want to update it with a preface stating that it has been defined in ES2016 and is now supported by most recent browsers :)
|
6

JavaScript doesn't have any notion of UTF-8 strings, so it's unlikely that you need the /u flag. (Your strings are probably already in the usual JavaScript form, one UTF-16 code-unit per "character".)

The bigger problem is that JavaScript doesn't support \p{L}, nor any equivalent notation; JavaScript regexes have no awareness of Unicode character properties. See the answers to this StackOverflow question for some ways to approximate it.


Edited to add: If you only need to support Polish letters, then you can write /^[\sa-zA-ZĄĆĘŁŃÓŚŹŻąćęłńóśźż]+$/. The a-z and A-Z parts cover the ASCII letters, and then the remaining letters are listed out individually.

6 Comments

Bad news... so maybe there is something to allow only those Polish characters: Ł, Ą, Ś, Ć, Ę instead?
Scott, if you have a small set of characters you want to allow you can always use a character class.
@Joey Yea, generally I would like to additionaly allow only those special characters I mentioned above.
In Javascript regexp you can refer to unicode chars like this: \u0161. For example this will allow only printable ASCII and Ć: var newtxt = txt.replace(/[^\u0107\u0020-\u007e]/g, '') . Unicode codes for your chars find for example here: fileformat.info/info/unicode/char/107/index.htm
@ruakh: Life is full of bizarre moments. :) For /Ć/ to work you MUST save js file in UTF-8. Sometimes, other people might use, change, save your code and they might use other encoding (eg. iso-8859-1). So /Ć/ will not be saved correctly and script will not work. If you use /\u0107/ that kind of bugs will be avoided.
|
1

As of ES2015, /u is supported in JavaScript. See:

3 Comments

It's currently not supported by all browsers.
@PoulBak It says on the Mozilla docs it's supported by all major browsers, unless they got it wrong.
Some versions of Edge will simply crash, if you use it, but I think that has been fixed, so you're probably right (noone use IE any more).
-1

Examples

Mentioned at: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode#description

  • /u allows you to use \u in the regexp literal. E.g. given that the character 'a' has ASCII and Unicode point 61, this matches with /u:

    'a'.match(/\u{61}/u) !== null
    

    But it does not match without /u:

    'a'.match(/\u{61}/) === null
    

    because in that case it gets interpreted as:

    • \u: nonexistent character class (e.g. \s would be spaces, but u doesn't have one). So it just means 'u'.
    • {61}: repeat 61 times

    so it would instead match:

    'u'.repeat(61).match(/\u{61}/) !== null
    

    Note that this case only matters when using a RegExp literal; if you use new RegExp explicitly then the string takes care of it for us without any flags:

    assert.notStrictEqual('a'.match(new RegExp('\u{61}')), null)
    
  • one case we can't work around with strings is for surrogate pairs. JavaScript was pretty much hardcoded to UTF-16, and as such if we use a point such as U+1F604 Smiling Face with Open Mouth and Smiling Eyes then 'u' matters:

    '\u{1F604}'.length === 2
    '\u{1F604}'.match(new RegExp('\u{1F604}')) !== null
    '\u{1F604}'.match(new RegExp('\uD83D')) !== null
    '\u{1F604}'.match(new RegExp('\uD83D', 'u')) === null
    

    This is because in UTF-16, 1F604 is encoded as 0xD83D 0xDE04, and so without \u 0xD83D can match just half of \u{1F604}. The horrors!

Tested on Node.js v20.10.0, Ubuntu 24.04.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.