Splitting text by a Unicode script in Python

Question

I have some text like this:

এর জন্য বুদ্ধির (Reason) প্রয়োজন নেই, প্রয়োজন নিজের

The language is Bengali (apart from the one English word of course).

I would like to obtain a list of Bengali words in the text (ie a word tokenization problem). Bengali has a Unicode range 0980 to 09FF. There is also a \p{Bengali} script (don't know how to use it). Here's what I have:

import re
Pattern = re.compile(r'\[\u0980-\u09FF]+')
Words = split(Pattern, Text)

Which is not working. How can I get this to work? I'd also prefer to use \p{Bengali} if possible, rather than the explicit Unicode range.

Just a check, are you on Python 3.x or 2.x? Makes a big difference with Unicode. — Gareth Latty
– Gareth Latty, Commented Apr 11, 2012 at 9:36
When using raw literal strings (r'') for regular expressions, you don't need to escape your square brackets. — MattH
– MattH, Commented Apr 11, 2012 at 9:48
@MattH: This has nothing to do with raw strings. In normal strings, you shouldn't escape the brackets either since \[ would be translated to \\[. — Tim Pietzcker
– Tim Pietzcker, Commented Apr 11, 2012 at 9:58

Tim Pietzcker · Accepted Answer · 2012-04-11 09:47:45Z

4

Python doesn't yet understand the Unicode script properties like \p{...}.

Your version should work after you remove the backslash that's escaping the bracket, and by not using split() but findall() (you didn't even use re.split() but I guess that was just a typo).

Also, since you're not using Python 3 as you stated in your recent comment, you probably need to use the re.UNICODE option and make sure that text is in fact a Unicode string.

import re
pattern = re.compile(ur'[\u0980-\u09FF]+', re.UNICODE)
words = re.findall(pattern, text)

edited Apr 11, 2012 at 9:47

answered Apr 11, 2012 at 9:37

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Gareth Latty Over a year ago

CapWords is generally used for class names in Python, PEP-8 recommends lowercase_with_underscores for local variables.

Tim Pietzcker Over a year ago

@Lattyware: Thanks, I was just about to comment on that.

Velvet Ghost Over a year ago

I tried: SimpleWordTokenizer = re.compile(r'[\u0980-\u09FF]+', re.UNICODE) Temp = re.findall(SimpleWordTokenizer, text) Unfortunately it gives me the English words instead of the Bengali ones!

Velvet Ghost Over a year ago

Ah! I got it to work by ur'[\u0980-\u09FF]+'. Apparently that 'u' in front made the difference.

Tim Pietzcker Over a year ago

Ah, of course, I overlooked that. Great to hear it works now.

Bahman Eslami · Accepted Answer · 2020-09-24 21:57:24Z

0

You can use the alternate regex library by installing it using pip:

pip3 install regex

and use the \p{ScriptName} pattern to find the script you're looking for:

import regex
t = "এর জন্য বুদ্ধির (Reason) প্রয়োজন নেই, প্রয়োজন নিজের"
t = regex.findall(r"[\p{Bengali}]+", t)
print(t)

Comments

Not_a_Golfer · Accepted Answer · 2012-04-11 09:37:02Z

-1

you can just split by white spaces:

>>> import re
>>> x = 'এর জন্য বুদ্ধির (Reason) প্রয়োজন নেই, প্রয়োজন নিজের'
>>> re.split('\s', x)
['\xe0\xa6\x8f\xe0\xa6\xb0', '\xe0\xa6\x9c\xe0\xa6\xa8\xe0\xa7\x8d\xe0\xa6\xaf', '\xe0\xa6\xac\xe0\xa7\x81\xe0\xa6\xa6\xe0\xa7\x8d\xe0\xa6\xa7\xe0\xa6\xbf\xe0\xa6\xb0', '(Reason)', '\xe0\xa6\xaa\xe0\xa7\x8d\xe0\xa6\xb0\xe0\xa6\xaf\xe0\xa6\xbc\xe0\xa7\x8b\xe0\xa6\x9c\xe0\xa6\xa8', '\xe0\xa6\xa8\xe0\xa7\x87\xe0\xa6\x87,', '\xe0\xa6\xaa\xe0\xa7\x8d\xe0\xa6\xb0\xe0\xa6\xaf\xe0\xa6\xbc\xe0\xa7\x8b\xe0\xa6\x9c\xe0\xa6\xa8', '\xe0\xa6\xa8\xe0\xa6\xbf\xe0\xa6\x9c\xe0\xa7\x87\xe0\xa6\xb0']

answered Apr 11, 2012 at 9:37

Not_a_Golfer

49.5k15 gold badges130 silver badges95 bronze badges

3 Comments

Gareth Latty Over a year ago

This doesn't achieve what is wanted, and in that case, an re is overkill, why not just do x.split()?

Velvet Ghost Over a year ago

I was thinking of splitting 'by Bengali' as described, since it would automatically get rid of all English characters, punctuation marks, etc.

Not_a_Golfer Over a year ago

True, didn't read it thoroughly enough and now for some reason I can't delete my answer.

Collectives™ on Stack Overflow

Splitting text by a Unicode script in Python

3 Answers 3

5 Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related