I have some text like this:
এর জন্য বুদ্ধির (Reason) প্রয়োজন নেই, প্রয়োজন নিজের
The language is Bengali (apart from the one English word of course).
I would like to obtain a list of Bengali words in the text (ie a word tokenization problem). Bengali has a Unicode range 0980 to 09FF. There is also a \p{Bengali} script (don't know how to use it). Here's what I have:
import re
Pattern = re.compile(r'\[\u0980-\u09FF]+')
Words = split(Pattern, Text)
Which is not working. How can I get this to work? I'd also prefer to use \p{Bengali} if possible, rather than the explicit Unicode range.
r'') for regular expressions, you don't need to escape your square brackets.\[would be translated to\\[.