1

I'm trying to detect in a text if there are characters belonging to the writing system of a language without word boundaries. These writing systems are the following according to Wikipedia (I have added the ISO 639-2 or 639-3 code)

Burmese  MY
Chinese ZH
Japanese JA
S'gaw Karen KAR
Khmer KM
Lao LP
ʼPhags-pa PHAG
Pwo Karen PWO
S'gaw Karen KAR
Tai Tham LANA
Thai TH
Tibetan BO

In the case of Chinese I'm using a specific regex for Han writing system:

HAN_REGEX = /[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FD5\uF900-\uFA6D\uFA70-\uFAD9]/;

as an equivalent to \p{Han}. An alternative solution for Chinese hieroglyphs is to use directly

let regexp = /\p{sc=Han}/gu;

So let's say given the Kanji Unicode Table, the charset range to detect JA in the text is this one:

KANJI_REGEX = /[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]/

but what about the other writing systems? Is the charset range the only way?

2
  • 1
    @SamuelLiew The OP lists 12 writing systems that share a common structure and are different from other writing systems. OP has shown that they know how to recognize each of the 12 independently, and are asking if there is a single regex that works for all, without having to specify the charset range for each one. I think the question is sufficiently focused, and am voting to reopen. Can you take another look please? Commented Nov 27, 2020 at 14:54
  • 2
    It works, I see that it has been reopened now. Commented Nov 28, 2020 at 4:23

1 Answer 1

1

This wouldn't take care of all of the cases because I can't seem to get a good reference for how to recognize scriptio continua, but it should get you mostly there.

let regex = new RegExp(/[\p{Script_Extensions=Mymr}\p{Script_Extensions=Han}\p{Script_Extensions=Hira}\p{Script_Extensions=Kana}\p{Script_Extensions=Bopo}\p{Script=Khmer}\p{Script=Lao}\p{Script_Extensions=Phag}\p{Script=Tai_Tham}\p{Script=Thai}\p{Script=Tibetan}]/u)

Script_Extensions will include all of the extended characters of a script in addition to the base, so you're usually better off using Script_Extensions when you can.

  • \p{Script_Extensions=Mymr} should match any characters from the Myanmar script (which is what the Burmese, S'gaw Karen, and Pwo Karen seem to be mapped to)
  • \p{Script_Extensions=Han} should match Han or Kanji characters
  • \p{Script_Extensions=Bopo} should match Bopomofo characters (since Hanb is Han+Bopo but unicode doesn't have a combination script, this should match the other Chinese characters)
  • \p{Script_Extensions=Hira} should match any Hiragana characters
  • \p{Script_Extensions=Kana} should match any Katakana characters
  • \p{Script=Khmer} should match characters in Khmer script
  • \p{Script=Lao} should match characters in Lao script
  • \p{Script_Extensions=Phag} should match characters in 'Phags-pa script
  • \p{Script=Tai_Tham} should match characters in Tai Tham script
  • \p{Script=Thai} should match characters in Thai script
  • \p{Script=Tibetan} should match characters in Tibetan script

And since unicode property escapes can't be used without the unicode flag set, it's important to remember to pass the 'u' flag.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.