Python: Split unicode string on word boundaries

Question

I need to take a string, and shorten it to 140 characters.

Currently I am doing:

if len(tweet) > 140:
    tweet = re.sub(r"\s+", " ", tweet) #normalize space
    footer = "… " + utils.shorten_urls(post['url'])
    avail = 140 - len(footer)
    words = tweet.split()
    result = ""
    for word in words:
        word += " "
        if len(word) > avail:
            break
        result += word
        avail -= len(word)
    tweet = (result + footer).strip()
    assert len(tweet) <= 140

So this works great for English, and English like strings, but fails for a Chinese string because tweet.split() just returns one array:

>>> s = u"简讯：新華社報道，美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域，預計約30分鐘後抵達浦東國際機場，開展他上任後首次訪華之旅。"
>>> s
u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002'
>>> s.split()
[u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002']

How should I do this so it handles I18N? Does this make sense in all languages?

I'm on python 2.5.4 if that matters.

Mark Byers · Accepted Answer · 2009-11-15 20:57:22Z

8

Chinese doesn't usually have whitespace between words, and the symbols can have different meanings depending on context. You will have to understand the text in order to split it at a word boundary. In other words, what you are trying to do is not easy in general.

answered Nov 15, 2009 at 20:57

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Paul Tarjan Over a year ago

Does it make sense to substring a Chinese string? Like if I do s[:120] will that still be readable?

Mark Byers Over a year ago

You may end up with half a word which could totally change the meaning. Imagine splitting "assist" at the first three letters.

Paul Tarjan Over a year ago

ok, thank you. Does "..." mean the same thing in other languages, or is there an alternate "ellipses" character

Mark Byers Over a year ago

I'm not sure what chararcter to use, but Wikipedia says something on the matter: en.wikipedia.org/wiki/Ellipsis#In_Chinese

John Machin Over a year ago

As far as I know, there is no special CJK ellipsis character. CJK characters are twice as wide ("full width") as Latin characters ("half width"), so it's probably better to use TWO ellipsis characters just as the Wikipedia article says: "In Chinese and sometimes in Japanese, ellipsis characters are done by entering two consecutive horizontal ellipsis (U+2026)." All of this presupposes that you have determined that the language in question is in fact Chinese, and not Japanese or Korean which also use the CJK characters and may well have different ellipsis conventions and ass/ist problems.

|

Alex Martelli · Accepted Answer · 2009-11-15 21:05:37Z

5

For word segmentation in Chinese, and other advanced tasks in processing natural language, consider NLTK as a good starting point if not a complete solution -- it's a rich Python-based toolkit, particularly good for learning about NL processing techniques (and not rarely good enough to offer you viable solution to some of these problems).

answered Nov 15, 2009 at 21:05

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

3 Comments

Laurence Gonsalves Over a year ago

"not rarely" == usually, sometimes, something else?

Alex Martelli Over a year ago

@Laurence, depends on how bleeding-edge your typical NL tasks are, and how production-hardened and performance-tuned you need your code to be. If you're dealing with terabytes of text or need low-latency response, so you must deploy on a large, highly scalable parallel cluster, NLTK will at best let you sketch a prototype, not offer a viable solution for your requirements; for lower-volume and more time-tolerant tasks, esp. well-known ones such as segmentation, "usually" applies -- but there are all kinds of intermediate needs and special problem quirks!-)

Paul Tarjan Over a year ago

I really don't want to train an NLP solution for word break discovery. I'm sure someone did this already, and just want a pre-boxed wordbreak splitter.

ʞɔıu · Accepted Answer · 2009-11-16 22:43:39Z

3

the re.U flag will treat \s according to the Unicode character properties database.

The given string, however, doesn't apparently contain any white space characters according to python's unicode database:

>>> x = u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002'
>>> re.compile(r'\s+', re.U).split(x)
[u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002']

answered Nov 16, 2009 at 22:43

ʞɔıu

48.7k36 gold badges110 silver badges156 bronze badges

1 Comment

Paul Tarjan Over a year ago

Right, but "whitespace" in english means word seperators, where as there is no word separators in chinese, only whitespace as sentence seperators.

unutbu · Accepted Answer · 2010-01-21 03:31:24Z

2

I tried out the solution with PyAPNS for push notifications and just wanted to share what worked for me. The issue I had is that truncating at 256 bytes in UTF-8 would result in the notification getting dropped. I had to make sure the notification was encoded as "unicode_escape" to get it to work. I'm assuming this is because the result is sent as JSON and not raw UTF-8. Anyways here is the function that worked for me:

def unicode_truncate(s, length, encoding='unicode_escape'):
    encoded = s.encode(encoding)[:length]
    return encoded.decode(encoding, 'ignore')

edited Jan 21, 2010 at 3:31

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

answered Jan 21, 2010 at 3:19

gigq

5213 silver badges4 bronze badges

Comments

Paul Tarjan · Accepted Answer · 2009-11-16 23:40:00Z

1

After speaking with some native Cantonese, Mandarin, and Japanese speakers it seems that the correct thing to do is hard, but my current algorithm still makes sense to them in the context of internet posts.

Meaning, they are used to the "split on space and add … at the end" treatment.

So I'm going to be lazy and stick with it, until I get complaints from people that don't understand it.

The only change to my original implementation would be to not force a space on the last word since it is unneeded in any language (and use the unicode character … &#x2026 instead of ... three dots to save 2 characters)

edited Nov 16, 2009 at 23:40

answered Nov 16, 2009 at 22:33

Paul Tarjan

50.9k59 gold badges176 silver badges214 bronze badges

1 Comment

ephemient Over a year ago

It's a named entity in HTML: …, horizontal ellipsis.

Noah · Accepted Answer · 2012-02-03 06:24:07Z

1

Basically, in CJK (Except Korean with spaces), you need dictionary look-ups to segment words properly. Depending on your exact definition of "word", Japanese can be more difficult than that, since not all inflected variants of a word (i.e. "行こう" vs. "行った") will appear in the dictionary. Whether it's worth the effort depends upon your application.

answered Feb 3, 2012 at 6:24

Noah

1,0561 gold badge12 silver badges23 bronze badges

Comments

thyu · Accepted Answer · 2020-10-15 13:51:27Z

What you're looking for is Chinese word segmentation tools. Word segmentation is not an easy task and is currently not perfectly solved. There are several tools:

CkipTagger

Developed by Academia Sinica, Taiwan.
jieba

Developed by Sun Junyi, a Baidu engineer.
pkuseg

Developed by Language Computing and Machine Learning Group, Peking University

If what you want is character segmentation, it can be done albeit not very useful.

>>> s = u"简讯：新華社報道，美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域，預計約30分鐘後抵達浦東國際機場，開展他上任後首次訪華之旅。"
>>> chars = list(s)
>>> chars
[u'\u7b80', u'\u8baf', u'\uff1a', u'\u65b0', u'\u83ef', u'\u793e', u'\u5831', u'\u9053', u'\uff0c', u'\u7f8e', u'\u570b', u'\u7e3d', u'\u7d71', u'\u5967', u'\u5df4', u'\u99ac', u'\u4e58', u'\u5750', u'\u7684', u'\u300c', u'\u7a7a', u'\u8ecd', u'\u4e00', u'\u865f', u'\u300d', u'\u5c08', u'\u6a5f', u'\u665a', u'\u4e0a', u'1', u'0', u'\u6642', u'4', u'2', u'\u5206', u'\u9032', u'\u5165', u'\u4e0a', u'\u6d77', u'\u7a7a', u'\u57df', u'\uff0c', u'\u9810', u'\u8a08', u'\u7d04', u'3', u'0', u'\u5206', u'\u9418', u'\u5f8c', u'\u62b5', u'\u9054', u'\u6d66', u'\u6771', u'\u570b', u'\u969b', u'\u6a5f', u'\u5834', u'\uff0c', u'\u958b', u'\u5c55', u'\u4ed6', u'\u4e0a', u'\u4efb', u'\u5f8c', u'\u9996', u'\u6b21', u'\u8a2a', u'\u83ef', u'\u4e4b', u'\u65c5', u'\u3002']
>>> print('/'.join(chars))
简/讯/：/新/華/社/報/道/，/美/國/總/統/奧/巴/馬/乘/坐/的/「/空/軍/一/號/」/專/機/晚/上/1/0/時/4/2/分/進/入/上/海/空/域/，/預/計/約/3/0/分/鐘/後/抵/達/浦/東/國/際/機/場/，/開/展/他/上/任/後/首/次/訪/華/之/旅/。

score 0 · Accepted Answer · 2009-11-15 22:03:08Z

0

This punts the word-breaking decision to the re module, but it may work well enough for you.

import re

def shorten(tweet, footer="", limit=140):
    """Break tweet into two pieces at roughly the last word break
    before limit.
    """
    lower_break_limit = limit / 2
    # limit under which to assume breaking didn't work as expected

    limit -= len(footer)

    tweet = re.sub(r"\s+", " ", tweet.strip())
    m = re.match(r"^(.{,%d})\b(?:\W|$)" % limit, tweet, re.UNICODE)
    if not m or m.end(1) < lower_break_limit:
        # no suitable word break found
        # cutting at an arbitrary location,
        # or if len(tweet) < lower_break_limit, this will be true and
        # returning this still gives the desired result
        return tweet[:limit] + footer
    return m.group(1) + footer

edited Nov 15, 2009 at 22:03

answered Nov 15, 2009 at 21:27

Roger Pate

2 Comments

Paul Tarjan Over a year ago

thanks. I added a check if there are no word boundaries. For english strings this is working great, but for my chinese example (double it to make it long) I end up with a string that is 137 chars long, not 140. len(shorten(s*2, "... end"))

Roger Pate Over a year ago

That means it's working as expected, as it breaks at the last \b\W. However, I don't know Chinese to know if this is actually a word break in that text. Try shorten("abcde " * 3, "", 13) for another example of how it breaks shorter than the limit.

a paid nerd · Accepted Answer · 2009-11-16 22:49:44Z

-1

Save two characters and use an elipsis (…, 0x2026) instead of three dots!

answered Nov 16, 2009 at 22:49

a paid nerd

31.7k31 gold badges141 silver badges180 bronze badges

3 Comments

Adam Byrtek Over a year ago

In UTF-8 ellipsis takes 3 bytes so not much to be saved there :)

a paid nerd Over a year ago

I used the word "characters" instead of "bytes" on purpose. :)

John Machin Over a year ago

Adam meant: You save two Unicode characters, but in UTF-8, U+2026 takes 3 bytes, and three dots take 1 byte each so there's no saving when you store it. My note: Conceptually it's better to use an ellipsis character.

Collectives™ on Stack Overflow

Python: Split unicode string on word boundaries

9 Answers 9

6 Comments

3 Comments

1 Comment

Comments

1 Comment

Comments

Comments

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

6 Comments

3 Comments

1 Comment

Comments

1 Comment

Comments

Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related