6

Google text-to-speech (TTS) has a 5000 character limit, while my text is about 50k characters. I need to chunk the string based on a given limit without cutting off the words.

“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”

How do I chunk this above string into a list of strings that are not over 20 characters without cutting off the words?

I looked at the NLTK library chunking section and didn't see anything there.

2
  • "without cutting off the words" - what does this mean? You mean you always want the splits to be in the white space between the words? Commented Jul 13, 2019 at 22:51
  • @Dan. yeah. that's right. Because i have to feed it through Google's text to speech API Commented Jul 13, 2019 at 23:22

5 Answers 5

7

This is a similar idea to Green Cloak Guy, but uses a generator rather than creating a list. This should be a little more memory-friendly with large texts and will allow you to iterate over the chunks lazily. You can turn it into a list with list() or use is anywhere an iterator is expected:

s = "Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news."

def get_chunks(s, maxlength):
    start = 0
    end = 0
    while start + maxlength  < len(s) and end != -1:
        end = s.rfind(" ", start, start + maxlength + 1)
        yield s[start:end]
        start = end +1
    yield s[start:]

chunks = get_chunks(s, 25)

#Make list with line lengths:
[(n, len(n)) for n in chunks]

results

[('Well, Prince, so Genoa', 22),
 ('and Lucca are now just', 22),
 ('family estates of the', 21),
 ('Buonapartes. But I warn', 23),
 ('you, if you don’t tell me', 25),
 ('that this means war, if', 23),
 ('you still try to defend', 23),
 ('the infamies and horrors', 24),
 ('perpetrated by that', 19),
 ('Antichrist—I really', 19),
 ('believe he is', 13),
 ('Antichrist—I will have', 22),
 ('nothing more to do with', 23),
 ('you and you are no longer', 25),
 ('my friend, no longer my', 23),
 ('‘faithful slave,’ as you', 24),
 ('call yourself! But how do', 25),
 ('you do? I see I have', 20),
 ('frightened you—sit down', 23),
 ('and tell me all the news.', 25)]
Sign up to request clarification or add additional context in comments.

5 Comments

If s = "aa aa aa aa aa aa aa" and you call get_chunks(s, 5) this will incorrectly return chunks of "aa " instead of "aa aa". Otherwise, it's a great solution.
That's an interesting test case @Dan, thanks. If it returns an chunks of "aa aa" it will lose a space— the one between the chunks. It seems it should start with "aa aa" and then the next chunk would be " aa "...
My assumption is that there is a space inferred between the separate elements of the chunked list
@Dan Yes, thats probably a better way to do it.
I think that edit deals with it @Dan. I appreciate the feedback — and would be happy to hear of any edge cases.
6

A base-python approach would look 20 characters ahead, find the last bit of whitespace possible, and cut the line there. This isn't an incredibly elegant implementation of that, but it should do the job:

orig_string = “Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”
list_of_lines = []
max_length = 20
while len(orig_string) > max_length:
    line_length = orig_string[:max_length].rfind(' ')
    list_of_lines.append(orig_string[:line_length])
    orig_string = orig_string[line_length + 1:]
list_of_lines.append(orig_string)

5 Comments

It seems to be leaving out the last bit... i've change the max_length number and seems to be leaving out the end no matter what.
Oh, sorry, I accidentally forgot to add the last line, in which you just add whatever's left of orig_string to list_of_lines.
If there is a space at character 20, wouldn't this cut it one word to short? i.e. before doing the rfind, you should check if orig_string[max_length] == ' '.
@Dan good point - though an easier way of handling that might be to just set max_length to 21.
Thanks for the correction. This works. At first I was cursing Google for making the limit so low. Now I love it because I can break it up and multithread it and combine it and it's really fast how it converts.
2

This problem of chunking TTS inputs is more complicated than merely splitting over words or sentences or even paragraphs. Depending on the language, especially for English, modern neural TTS benefit from having contextual information of the surrounding words, especially the preceding words.

As such, instead of just splitting over words or sentences, a more correct way to split would be:

  1. First at paragraph boundaries.
  2. Then at sentence boundaries in case of paragraphs that are too large.
  3. Then at word boundaries in case of sentences that are too large.
  4. Lastly over characters in case of words that are too large.

After this, any consecutive splits that can be merged should be merged with appropriate separators. Overall, this keeps the chunks more intelligible for the TTS.


Python packages like semantic-text-splitter and semchunk generalize this task of "semantic splitting". Here is a solution using semantic-text-splitter:

from semantic_text_splitter import TextSplitter

def semantic_split(text: str, limit: int) -> list[str]:
    """Return a list of chunks from the given text, splitting it at semantically sensible boundaries while applying the specified character length limit for each chunk."""
    # Ref: https://stackoverflow.com/a/78288960/
    splitter = TextSplitter(limit)
    chunks = splitter.chunks(text)
    return chunks

LIMIT = 50
TEXT = """“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”"""
chunks = semantic_split(TEXT, LIMIT)

# Print chunks:
for num, chunk in enumerate(chunks, start=1):
    print({"#": num, "len": len(chunk), "chunk": chunk})

Output:

{'#': 1, 'len': 46, 'chunk': '“Well, Prince, so Genoa and Lucca are now just'}
{'#': 2, 'len': 34, 'chunk': 'family estates of the Buonapartes.'}
{'#': 3, 'len': 46, 'chunk': 'But I warn you, if you don’t tell me that this'}
{'#': 4, 'len': 50, 'chunk': 'means war, if you still try to defend the infamies'}
{'#': 5, 'len': 44, 'chunk': 'and horrors perpetrated by that Antichrist—I'}
{'#': 6, 'len': 43, 'chunk': 'really believe he is Antichrist—I will have'}
{'#': 7, 'len': 49, 'chunk': 'nothing more to do with you and you are no longer'}
{'#': 8, 'len': 48, 'chunk': 'my friend, no longer my ‘faithful slave,’ as you'}
{'#': 9, 'len': 33, 'chunk': 'call yourself! But how do you do?'}
{'#': 10, 'len': 48, 'chunk': 'I see I have frightened you—sit down and tell me'}
{'#': 11, 'len': 14, 'chunk': 'all the news.”'}

Comments

0

Building on Mark's answer, it looks like there's a small bug in the code when dealing with the end of the search, something like this might work:

    def text_to_chunks(s, maxlength):
        start = 0
        end   = 0
        while start + maxlength  < len(s) and end != -1:
            end = s.rfind(" ", start, start + maxlength + 1)
            if end == -1: break
            yield s[start:end]
            start = end +1
        yield s[start:]

Comments

-2

You can use the nltk.tokenize methods as follows:

import nltk

corpus = '''
Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.” 
'''

tokens = nltk.tokenize.word_tokenize(corpus)

or

sent_tokens = nltk.tokenize.sent_tokenize(corpus)

2 Comments

How does this create chunks of a certain size?
nltk.tokenize could be relevant as a low-level utility for constructing a splitter, but it doesn't in itself look to provide all that is necessary for the problem at hand.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.