Regex in JavaScript to match writing systems without word boundaries

Question

I'm trying to detect in a text if there are characters belonging to the writing system of a language without word boundaries. These writing systems are the following according to Wikipedia (I have added the ISO 639-2 or 639-3 code)

Burmese  MY
Chinese ZH
Japanese JA
S'gaw Karen KAR
Khmer KM
Lao LP
ʼPhags-pa PHAG
Pwo Karen PWO
S'gaw Karen KAR
Tai Tham LANA
Thai TH
Tibetan BO

In the case of Chinese I'm using a specific regex for Han writing system:

HAN_REGEX = /[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FD5\uF900-\uFA6D\uFA70-\uFAD9]/;

as an equivalent to \p{Han}. An alternative solution for Chinese hieroglyphs is to use directly

let regexp = /\p{sc=Han}/gu;

So let's say given the Kanji Unicode Table, the charset range to detect JA in the text is this one:

KANJI_REGEX = /[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]/

but what about the other writing systems? Is the charset range the only way?

@SamuelLiew The OP lists 12 writing systems that share a common structure and are different from other writing systems. OP has shown that they know how to recognize each of the 12 independently, and are asking if there is a single regex that works for all, without having to specify the charset range for each one. I think the question is sufficiently focused, and am voting to reopen. Can you take another look please? — cigien
– cigien, Commented Nov 27, 2020 at 14:54

Amy Shackles · Accepted Answer · 2020-12-06 04:20:57Z

This wouldn't take care of all of the cases because I can't seem to get a good reference for how to recognize scriptio continua, but it should get you mostly there.

let regex = new RegExp(/[\p{Script_Extensions=Mymr}\p{Script_Extensions=Han}\p{Script_Extensions=Hira}\p{Script_Extensions=Kana}\p{Script_Extensions=Bopo}\p{Script=Khmer}\p{Script=Lao}\p{Script_Extensions=Phag}\p{Script=Tai_Tham}\p{Script=Thai}\p{Script=Tibetan}]/u)

Script_Extensions will include all of the extended characters of a script in addition to the base, so you're usually better off using Script_Extensions when you can.

\p{Script_Extensions=Mymr} should match any characters from the Myanmar script (which is what the Burmese, S'gaw Karen, and Pwo Karen seem to be mapped to)
\p{Script_Extensions=Han} should match Han or Kanji characters
\p{Script_Extensions=Bopo} should match Bopomofo characters (since Hanb is Han+Bopo but unicode doesn't have a combination script, this should match the other Chinese characters)
\p{Script_Extensions=Hira} should match any Hiragana characters
\p{Script_Extensions=Kana} should match any Katakana characters
\p{Script=Khmer} should match characters in Khmer script
\p{Script=Lao} should match characters in Lao script
\p{Script_Extensions=Phag} should match characters in 'Phags-pa script
\p{Script=Tai_Tham} should match characters in Tai Tham script
\p{Script=Thai} should match characters in Thai script
\p{Script=Tibetan} should match characters in Tibetan script

And since unicode property escapes can't be used without the unicode flag set, it's important to remember to pass the 'u' flag.

Collectives™ on Stack Overflow

Regex in JavaScript to match writing systems without word boundaries

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related