How to split a string into a list of predefined substrings of different lengths?

Question

Given a collection of predefined strings of unequal length, input a string, and split the string into occurrences of elements in the collection, the output should be unique for every input, and it should prefer the longest possible chunks.

For example, it should split s, c, h into different chunks, unless they are adjacent.

If "sc" appear together, it should be grouped into 'sc' and not as 's', 'c', similarly if "sh" appears then it must be grouped into 'sh', if "ch" appears then it should be grouped into 'ch', and finally "sch" should be grouped into 'sch'.

I only know string.split(delim) splits on specified delimiter, and re.split('\w{n}', string) splits string into chunks of equal lengths, both these methods don't give the intended result, how can this be done?

Pseudo code:

def phonemic_splitter(string):
    phonemes = ['a', 'sh', 's', 'g', 'n', 'c', 'e', 'ch', 'sch']
    output = do_something(string)
    return output

And example outputs:

phonemic_splitter('case') -> ['c', 'a', 's', 'e']
phonemic_splitter('ash') -> ['a', 'sh']
phonemic_splitter('change') -> ['ch', 'a', 'n', 'g', 'e']
phonemic_splitter('schane') -> ['sch', 'a', 'n', 'e']

can you please explain what is the input and output required question in confusing — Divek John
– Divek John, Commented Aug 25, 2021 at 13:50

Riccardo Bucco · Accepted Answer · 2021-08-25 14:21:32Z

2

Here is a possible solution:

def phonemic_splitter(s, phonemes):
    phonemes = sorted(phonemes, key=len, reverse=True)
    result = []
    while s:
        result.append(next(filter(s.startswith, phonemes)))
        s = s[len(result[-1]):]
    return result

This solution relies on the fact that phonemes contains a list of all the possible phonemes that can be found within the string s (otherwise, next could raise an exception).

One could also speed up this solution by implementing a binary search to be used in place of next.

edited Aug 25, 2021 at 14:21

answered Aug 25, 2021 at 14:13

Riccardo Bucco

15.5k4 gold badges29 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

dawg · Accepted Answer · 2021-08-25 14:45:37Z

0

You could use a regex:

import re 
cases=['case', 'ash', 'change', 'schane']

for e in cases:
    print(repr(e), '->', re.findall(r'sch|sh|ch|[a-z]', e))

Prints:

'case' -> ['c', 'a', 's', 'e']
'ash' -> ['a', 'sh']
'change' -> ['ch', 'a', 'n', 'g', 'e']
'schane' -> ['sch', 'a', 'n', 'e']

You could incorporate into your function this way:

import re 

def do_something(s, splits):
    pat='|'.join(sorted(
                   [f'{x}' for x in splits if len(x)>1],         
                    key=len, reverse=True))+'|[a-z]'
    return re.findall(pat, s)

def phonemic_splitter(string):
    phonemes = ['a', 'sh', 's', 'g', 'n', 'c', 'e', 'ch', 'sch']
    output = do_something(string, phonemes)
    return output

answered Aug 25, 2021 at 14:45

dawg

105k24 gold badges143 silver badges217 bronze badges

Collectives™ on Stack Overflow

How to split a string into a list of predefined substrings of different lengths?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related