How to split unicode strings character by character in python?

Question

My website supports a number of Indian languages. The user can change the language dynamically. When user inputs some string value, I have to split the string value into its individual characters. So, I'm looking for a way to write a common function that will work for English and a select set of Indian languages. I have searched across sites, however, there appears to be no common way to handle this requirement. There are language-specific implementations (for example Open-Tamil package for Tamil implements get_letters) but I could not find a common way to split or iterate through the characters in a unicode string taking the graphemes into consideration.

One of the many methods that I've tried:

name = u'தமிழ்'
print name
for i in list(name):
  print i

#expected output
தமிழ்
த
மி
ழ்

#actual output
தமிழ்
த
ம
ி
ழ
்

#Here is another an example using another Indian language
name = u'हिंदी'
print name
for i in list(name):
  print i

#expected output
हिंदी
हिं
दी

#actual output
हिंदी
ह
ि  
ं 
द
ी

jfs · Accepted Answer · 2015-10-12 09:37:35Z

12

To get "user-perceived" characters whatever the language, use \X (eXtended grapheme cluster) regular expression:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex # $ pip install regex

for text in [u'தமிழ்', u'हिंदी']:
    print("\n".join(regex.findall(r'\X', text, regex.U)))

Output

த
மி
ழ்
हिं
दी

answered Oct 12, 2015 at 9:37

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ignacio Vazquez-Abrams · Accepted Answer · 2015-10-11 18:55:03Z

8

The way to solve this is to group all "L" category characters with their subsequent "M" category characters:

>>> regex.findall(ur'\p{L}\p{M}*', name)
[u'\u0ba4', u'\u0bae\u0bbf', u'\u0bb4\u0bcd']
>>> for c in regex.findall(ur'\p{L}\p{M}*', name):
...   print c
... 
த
மி
ழ்

regex

answered Oct 11, 2015 at 18:55

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

6 Comments

user1928896 Over a year ago

Hi, did you mean 'regex' or 're'? I tried 're.findall(ur'\p{L}\p{M}*', name)' and it returned an empty list.

Ignacio Vazquez-Abrams Over a year ago

I meant "regex". Which is why I wrote "regex". And included a link to regex.

user1928896 Over a year ago

As it turns out, I cannot use the regex module in my app engine application since regex is not pure python but includes c extension. Is there an alternative solution to this problem using Python's remodule or some other means of achieving this?

Ignacio Vazquez-Abrams Over a year ago

You'll have to use unicodedata.category() to get the category of each character in turn and group them accordingly.

tchrist Over a year ago

While this may work in this particular case, \X is the preferred mechanism for pulling out individual grapheme clusters.

|

Aidan Fitzpatrick · Accepted Answer · 2018-02-10 14:52:33Z

2

uniseg works really well for this, and the docs are OK. The other answer to this question works for international Unicode characters, but falls flat if users enter Emoji. The solution below will work:

>>> emoji = u'😀😃😄😁'
>>> from uniseg.graphemecluster import grapheme_clusters
>>> for c in list(grapheme_clusters(emoji)):
...     print c
...
😀
😃
😄
😁

This is from pip install uniseg==0.7.1.

edited Feb 10, 2018 at 14:52

answered Mar 11, 2017 at 17:35

Aidan Fitzpatrick

2,0751 gold badge22 silver badges26 bronze badges

1 Comment

Clemens Tolboom Over a year ago

I tested your emoticons using RegEx 2022.3.15 whichs works fine using eXtended Graphemes \X. RegEx has evolved I guess.

Collectives™ on Stack Overflow

How to split unicode strings character by character in python?

3 Answers 3

Output

Comments

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Output

Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related