1

Given a collection of predefined strings of unequal length, input a string, and split the string into occurrences of elements in the collection, the output should be unique for every input, and it should prefer the longest possible chunks.

For example, it should split s, c, h into different chunks, unless they are adjacent.

If "sc" appear together, it should be grouped into 'sc' and not as 's', 'c', similarly if "sh" appears then it must be grouped into 'sh', if "ch" appears then it should be grouped into 'ch', and finally "sch" should be grouped into 'sch'.

I only know string.split(delim) splits on specified delimiter, and re.split('\w{n}', string) splits string into chunks of equal lengths, both these methods don't give the intended result, how can this be done?

Pseudo code:

def phonemic_splitter(string):
    phonemes = ['a', 'sh', 's', 'g', 'n', 'c', 'e', 'ch', 'sch']
    output = do_something(string)
    return output

And example outputs:

phonemic_splitter('case') -> ['c', 'a', 's', 'e']
phonemic_splitter('ash') -> ['a', 'sh']
phonemic_splitter('change') -> ['ch', 'a', 'n', 'g', 'e']
phonemic_splitter('schane') -> ['sch', 'a', 'n', 'e']
1
  • can you please explain what is the input and output required question in confusing Commented Aug 25, 2021 at 13:50

2 Answers 2

2

Here is a possible solution:

def phonemic_splitter(s, phonemes):
    phonemes = sorted(phonemes, key=len, reverse=True)
    result = []
    while s:
        result.append(next(filter(s.startswith, phonemes)))
        s = s[len(result[-1]):]
    return result

This solution relies on the fact that phonemes contains a list of all the possible phonemes that can be found within the string s (otherwise, next could raise an exception).

One could also speed up this solution by implementing a binary search to be used in place of next.

Sign up to request clarification or add additional context in comments.

Comments

0

You could use a regex:

import re 
cases=['case', 'ash', 'change', 'schane']

for e in cases:
    print(repr(e), '->', re.findall(r'sch|sh|ch|[a-z]', e))

Prints:

'case' -> ['c', 'a', 's', 'e']
'ash' -> ['a', 'sh']
'change' -> ['ch', 'a', 'n', 'g', 'e']
'schane' -> ['sch', 'a', 'n', 'e']

You could incorporate into your function this way:

import re 

def do_something(s, splits):
    pat='|'.join(sorted(
                   [f'{x}' for x in splits if len(x)>1],         
                    key=len, reverse=True))+'|[a-z]'
    return re.findall(pat, s)

def phonemic_splitter(string):
    phonemes = ['a', 'sh', 's', 'g', 'n', 'c', 'e', 'ch', 'sch']
    output = do_something(string, phonemes)
    return output

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.