0

I have some text like this:

এর জন্য বুদ্ধির (Reason) প্রয়োজন নেই, প্রয়োজন নিজের

The language is Bengali (apart from the one English word of course).

I would like to obtain a list of Bengali words in the text (ie a word tokenization problem). Bengali has a Unicode range 0980 to 09FF. There is also a \p{Bengali} script (don't know how to use it). Here's what I have:

import re
Pattern = re.compile(r'\[\u0980-\u09FF]+')
Words = split(Pattern, Text)

Which is not working. How can I get this to work? I'd also prefer to use \p{Bengali} if possible, rather than the explicit Unicode range.

4
  • Just a check, are you on Python 3.x or 2.x? Makes a big difference with Unicode. Commented Apr 11, 2012 at 9:36
  • When using raw literal strings (r'') for regular expressions, you don't need to escape your square brackets. Commented Apr 11, 2012 at 9:48
  • Yes exactly. Just found that out. Thanks. Commented Apr 11, 2012 at 9:52
  • @MattH: This has nothing to do with raw strings. In normal strings, you shouldn't escape the brackets either since \[ would be translated to \\[. Commented Apr 11, 2012 at 9:58

3 Answers 3

4

Python doesn't yet understand the Unicode script properties like \p{...}.

Your version should work after you remove the backslash that's escaping the bracket, and by not using split() but findall() (you didn't even use re.split() but I guess that was just a typo).

Also, since you're not using Python 3 as you stated in your recent comment, you probably need to use the re.UNICODE option and make sure that text is in fact a Unicode string.

import re
pattern = re.compile(ur'[\u0980-\u09FF]+', re.UNICODE)
words = re.findall(pattern, text)
Sign up to request clarification or add additional context in comments.

5 Comments

CapWords is generally used for class names in Python, PEP-8 recommends lowercase_with_underscores for local variables.
@Lattyware: Thanks, I was just about to comment on that.
I tried: SimpleWordTokenizer = re.compile(r'[\u0980-\u09FF]+', re.UNICODE) Temp = re.findall(SimpleWordTokenizer, text) Unfortunately it gives me the English words instead of the Bengali ones!
Ah! I got it to work by ur'[\u0980-\u09FF]+'. Apparently that 'u' in front made the difference.
Ah, of course, I overlooked that. Great to hear it works now.
0

You can use the alternate regex library by installing it using pip:

pip3 install regex

and use the \p{ScriptName} pattern to find the script you're looking for:

import regex
t = "এর জন্য বুদ্ধির (Reason) প্রয়োজন নেই, প্রয়োজন নিজের"
t = regex.findall(r"[\p{Bengali}]+", t)
print(t)

More on the regex module here

Comments

-1

you can just split by white spaces:

>>> import re
>>> x = 'এর জন্য বুদ্ধির (Reason) প্রয়োজন নেই, প্রয়োজন নিজের'
>>> re.split('\s', x)
['\xe0\xa6\x8f\xe0\xa6\xb0', '\xe0\xa6\x9c\xe0\xa6\xa8\xe0\xa7\x8d\xe0\xa6\xaf', '\xe0\xa6\xac\xe0\xa7\x81\xe0\xa6\xa6\xe0\xa7\x8d\xe0\xa6\xa7\xe0\xa6\xbf\xe0\xa6\xb0', '(Reason)', '\xe0\xa6\xaa\xe0\xa7\x8d\xe0\xa6\xb0\xe0\xa6\xaf\xe0\xa6\xbc\xe0\xa7\x8b\xe0\xa6\x9c\xe0\xa6\xa8', '\xe0\xa6\xa8\xe0\xa7\x87\xe0\xa6\x87,', '\xe0\xa6\xaa\xe0\xa7\x8d\xe0\xa6\xb0\xe0\xa6\xaf\xe0\xa6\xbc\xe0\xa7\x8b\xe0\xa6\x9c\xe0\xa6\xa8', '\xe0\xa6\xa8\xe0\xa6\xbf\xe0\xa6\x9c\xe0\xa7\x87\xe0\xa6\xb0']

3 Comments

This doesn't achieve what is wanted, and in that case, an re is overkill, why not just do x.split()?
I was thinking of splitting 'by Bengali' as described, since it would automatically get rid of all English characters, punctuation marks, etc.
True, didn't read it thoroughly enough and now for some reason I can't delete my answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.