Python and Unicode Blocks for regex

Question

Coming from the land of Perl, I can do something like the following to test the membership of a string in a particular unicode block:

# test if string has any katakana script characters
my $japanese = "カタカナ";
if ($japanese =~ /\p{InKatakana}/) {
   print "string has katakana"
}

I've read that Python does not support unicode blocks (true?) - so what's the best way to impliment this manually? For example, the above unicode block range for {InKatakana} should be U+30A0…U+30FF. How can I test the unicode range in Python? Any other recommended solutions?

I would prefer not to go with an external wrapper like Ponyguruma to limit the number of dependencies for roll-out/maintenance.

Ignacio Vazquez-Abrams · Accepted Answer · 2010-06-29 22:40:41Z

8

>>> re.search(u'[\u30a0-\u30ff]', u'カタカナ')
<_sre.SRE_Match object at 0x7fa0dbb62578>

answered Jun 29, 2010 at 22:40

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Reiner Gerecke · Accepted Answer · 2011-02-08 11:29:50Z

2

As Ignacio said, the re expression is very useful. Don't forget the import first. This search only finds full-width katakana.

import re  
re.search(u'[\u30a0-\u30ff]', u'カタカナ')

Or you might already have a string on hand.

import re  
x = "カタカナ"  
re.search(u'[\u30a0-\u30ff]', x.decode('utf-8'))

edited Feb 8, 2011 at 11:29

Reiner Gerecke

12.3k2 gold badges51 silver badges41 bronze badges

answered Feb 8, 2011 at 11:23

dper

211 bronze badge

Collectives™ on Stack Overflow

Python and Unicode Blocks for regex

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related