Recursive right-to-left segmenting (tokenizing) of strings in Python

Question

I want to segment a list of strings based on a custom list of words/forms, from right-to-left. Here I've illustrated it with made-up English examples, but the point is to segment/tokenize using a custom list of forms. (I was also wondering whether I could/should use NLTK tokenize for this, but it looked like that might be too complex for this use case).

What would be a good Pythonic way to do this (readable and efficient)? I managed to get something working, but I don't have much experience and I'm curious how this can be improved. The code below uses a recursive function to keep splitting the left element where possible, and a $ anchor in the regular expression to match the morphs only in string-final position.

phrases = ['greenhouse', 'flowering plants', 'ecosystemcollapse']
morphs = ['house', 'flower', 'system', 'collapse']

def segment(text):
    segmented_text = [match for match in re.split('(' + '|'.join(morphs) + ')' + '$', text) if match]
    if len(segmented_text) == 2:
        segmented_morphs.insert(0, segmented_text[1])
        segment(segmented_text[0])
    else:
        segmented_morphs.insert(0, segmented_text[0])

for phrase in phrases:
    segmented_morphs = []
    words = phrase.split()
    for word in reversed(words):
        segment(word)
    print(segmented_morphs)

Result (as desired):

['green', 'house']
['flowering', 'plants'] # here 'flower' should not be segmented
['eco', 'system', 'collapse']

You've probably seen this topic, but in case you did not - here is a nice collection of different approaches: How to split text without spaces into list of words?. I personally like the wordninja approach since it's just two lines - importing a module and executing a function :) — alecxe
– alecxe, Commented Sep 22, 2017 at 17:51
What should be the result of the phrase ecosystemcollapsed? — Jared Goguen
– Jared Goguen, Commented Sep 22, 2017 at 18:17
My goal (at least for now) is strict adherence to the morph list and right-to-left parsing, so In this case the phrase would remain unsegmented until the morph list is updated. — arjan
– arjan, Commented Sep 22, 2017 at 18:27
How long is the list of morphs, in realistic usage? Neither of the reviews so far has addressed the efficiency concern. — 200_success
– 200_success, Commented Sep 24, 2017 at 10:11

Jared Goguen · Accepted Answer · 2017-09-22 18:37:22Z

Here are some general comments:

Having a function mutate a global variable is a bad practice. In this case, it may seem easy enough to reason about what is going on, but I guarantee that at some point it will lead to an insidious bug that is hard to track down because you have no idea who is changing the global variable and when.
Recursion is not necessary here; it over-complicates the solution and may make it difficult to change/update the code.

The code itself is rather concise, but the weird data-flow out of the function makes it difficult to understand what's going on. This is complicated because the function is recursive and so you then have to keep track of the calling stack when figuring out how segmented_morphs is being changed.

I don't think the re module is really necessary here, I bet that it is heavier that just searching for strings and doesn't seem to offer any benefit. I've kept it for now, but I might offer another solution without it.

Here's how I would approach the problem:

import re

def segment_word(word, morphs):
    regex = re.compile('(' + '|'.join(morphs) + ')' + '$')
    splitter = lambda s: (regex.split(s) + [None])[:2]

    segments = []
    word, end = splitter(word)
    while end is not None:
        segments.append(end)
        word, end = splitter(word)

    if word:
        segments.append(word)

    return segments[::-1]

def segment_phrase(phrase, morphs):
    return [s for w in phrase.split() for s in segment_word(w, morphs)]

Here's a version without re, I'm not super happy about either. Neither version seems particularly elegant.

def segment_word(word, morphs):
    segments = []
    previous_length = -1
    while len(word) != previous_length:
        previous_length = len(word)
        for morph in morphs:
            if word.endswith(morph):
                segments.append(morph)
                word = word[:-len(morph)]

    if word:
        segments.append(word)

    return segments[::-1]

Thanks, good points! I thought the global variable was icky but was just glad it was at least working ;) — arjan
– arjan, Commented Sep 24, 2017 at 9:50

200_success · Accepted Answer · 2017-09-23 05:01:42Z

Your solution is rather awkward due to the weird recursion that mutates a global segmented_morphs variable. Well designed functions should never have side-effects like that.

For that matter, you should probably avoid capturing morphs as a global variable as well: it could be a parameter, or perhaps a member variable in an object.

I also don't like the use of re.split(), when you are really only interested in matching.

Regex searches produce the leftmost longest matches. Therefore, the trick to using regular expressions to find suffixes is to reverse everything:

reverse the subject string,
reverse the morphs that you are looking for,
reverse the matches,
reverse the text within each match.

Note the use of \S+ in the regex below: it catches any unrecognized prefix in each word. Also, it's good practice to escape strings when composing regexes, in case the constituent parts contain any regex metacharacters.

import re

def segment(suffixes, text):
    """
    Segment the text into a list of strings.

    >>> morphs = ['house', 'flower', 'system', 'collapse']
    >>> segment(morphs, 'greenhouse')
    ['green', 'house']
    >>> segment(morphs, 'flowering plants')
    ['flowering', 'plants']
    >>> segment(morphs, 'ecosystemcollapse')
    ['eco', 'system', 'collapse']
    """
    suffix_re = '|'.join(re.escape(s[::-1]) for s in suffixes) + r'|\S+'
    return [
        match.group()[::-1] for match in re.finditer(suffix_re, text[::-1])
    ][::-1]

If you intend to reuse the same morphs to analyze multiple phrases, here's an object-oriented design:

class Segmenter:
    def __init__(self, suffixes):
        self.re = re.compile(
            '|'.join(re.escape(s[::-1]) for s in suffixes) + r'|\S+'
        )

    def segment(self, text):
        return [
            match.group()[::-1] for match in self.re.finditer(text[::-1])
        ][::-1]

phrases = …
segmenter = Segmenter(['house', 'flower', 'system', 'collapse'])
for phrase in phrases:
    print(segmenter.segment(phrase))

Thanks, very cool. Which part is dealing with the spaces in strings here? — arjan
– arjan, Commented Sep 24, 2017 at 9:55
The regex never matches spaces, so finditer() skips over them as delimiters. — 200_success
– 200_success, Commented Sep 24, 2017 at 10:03
I noticed your comment on efficiency on the question itself. Would it be correct to say that "if the re implementation uses suffix trees for matching, then this reversed method would be relatively efficient"? — Jared Goguen
– Jared Goguen, Commented Sep 24, 2017 at 16:03
@200_success Finally got around to testing a bit further. This version is about 4 times faster than Jared's non-regex version in my test of about 220k records. That one seems to be easier to customize further, though. For example, in this version, is there a way to match a segment only if the remaining string would have at least 2 characters, and otherwise match the whole string? — arjan
– arjan, Commented Mar 14, 2018 at 17:17

Stack Exchange Network

Recursive right-to-left segmenting (tokenizing) of strings in Python

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Recursive right-to-left segmenting (tokenizing) of strings in Python

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions