I want to segment a list of strings based on a custom list of words/forms, from right-to-left. Here I've illustrated it with made-up English examples, but the point is to segment/tokenize using a custom list of forms. (I was also wondering whether I could/should use NLTK tokenize for this, but it looked like that might be too complex for this use case).
What would be a good Pythonic way to do this (readable and efficient)? I managed to get something working, but I don't have much experience and I'm curious how this can be improved. The code below uses a recursive function to keep splitting the left element where possible, and a $ anchor in the regular expression to match the morphs only in string-final position.
phrases = ['greenhouse', 'flowering plants', 'ecosystemcollapse']
morphs = ['house', 'flower', 'system', 'collapse']
def segment(text):
segmented_text = [match for match in re.split('(' + '|'.join(morphs) + ')' + '$', text) if match]
if len(segmented_text) == 2:
segmented_morphs.insert(0, segmented_text[1])
segment(segmented_text[0])
else:
segmented_morphs.insert(0, segmented_text[0])
for phrase in phrases:
segmented_morphs = []
words = phrase.split()
for word in reversed(words):
segment(word)
print(segmented_morphs)
Result (as desired):
['green', 'house']
['flowering', 'plants'] # here 'flower' should not be segmented
['eco', 'system', 'collapse']
wordninjaapproach since it's just two lines - importing a module and executing a function :) \$\endgroup\$ecosystemcollapsed? \$\endgroup\$