Python: splitting a long TTS input text into chunks of strings, given character limit

Question

Google text-to-speech (TTS) has a 5000 character limit, while my text is about 50k characters. I need to chunk the string based on a given limit without cutting off the words.

“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”

How do I chunk this above string into a list of strings that are not over 20 characters without cutting off the words?

I looked at the NLTK library chunking section and didn't see anything there.

"without cutting off the words" - what does this mean? You mean you always want the splits to be in the white space between the words? — Dan
– Dan, Commented Jul 13, 2019 at 22:51
@Dan. yeah. that's right. Because i have to feed it through Google's text to speech API — jason
– jason, Commented Jul 13, 2019 at 23:22

Mark · Accepted Answer · 2019-07-14 00:16:41Z

7

This is a similar idea to Green Cloak Guy, but uses a generator rather than creating a list. This should be a little more memory-friendly with large texts and will allow you to iterate over the chunks lazily. You can turn it into a list with list() or use is anywhere an iterator is expected:

s = "Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news."

def get_chunks(s, maxlength):
    start = 0
    end = 0
    while start + maxlength  < len(s) and end != -1:
        end = s.rfind(" ", start, start + maxlength + 1)
        yield s[start:end]
        start = end +1
    yield s[start:]

chunks = get_chunks(s, 25)

#Make list with line lengths:
[(n, len(n)) for n in chunks]

results

[('Well, Prince, so Genoa', 22),
 ('and Lucca are now just', 22),
 ('family estates of the', 21),
 ('Buonapartes. But I warn', 23),
 ('you, if you don’t tell me', 25),
 ('that this means war, if', 23),
 ('you still try to defend', 23),
 ('the infamies and horrors', 24),
 ('perpetrated by that', 19),
 ('Antichrist—I really', 19),
 ('believe he is', 13),
 ('Antichrist—I will have', 22),
 ('nothing more to do with', 23),
 ('you and you are no longer', 25),
 ('my friend, no longer my', 23),
 ('‘faithful slave,’ as you', 24),
 ('call yourself! But how do', 25),
 ('you do? I see I have', 20),
 ('frightened you—sit down', 23),
 ('and tell me all the news.', 25)]

edited Jul 14, 2019 at 0:16

answered Jul 13, 2019 at 23:49

Mark

92.7k8 gold badges116 silver badges156 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Dan Over a year ago

If s = "aa aa aa aa aa aa aa" and you call get_chunks(s, 5) this will incorrectly return chunks of "aa " instead of "aa aa". Otherwise, it's a great solution.

Mark Over a year ago

That's an interesting test case @Dan, thanks. If it returns an chunks of "aa aa" it will lose a space— the one between the chunks. It seems it should start with "aa aa" and then the next chunk would be " aa "...

Dan Over a year ago

My assumption is that there is a space inferred between the separate elements of the chunked list

Mark Over a year ago

@Dan Yes, thats probably a better way to do it.

Mark Over a year ago

I think that edit deals with it @Dan. I appreciate the feedback — and would be happy to hear of any edge cases.

Green Cloak Guy · Accepted Answer · 2019-07-13 23:35:43Z

6

A base-python approach would look 20 characters ahead, find the last bit of whitespace possible, and cut the line there. This isn't an incredibly elegant implementation of that, but it should do the job:

orig_string = “Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”
list_of_lines = []
max_length = 20
while len(orig_string) > max_length:
    line_length = orig_string[:max_length].rfind(' ')
    list_of_lines.append(orig_string[:line_length])
    orig_string = orig_string[line_length + 1:]
list_of_lines.append(orig_string)

edited Jul 13, 2019 at 23:35

answered Jul 13, 2019 at 22:52

Green Cloak Guy

24.8k4 gold badges39 silver badges58 bronze badges

5 Comments

jason Over a year ago

It seems to be leaving out the last bit... i've change the max_length number and seems to be leaving out the end no matter what.

Green Cloak Guy Over a year ago

Oh, sorry, I accidentally forgot to add the last line, in which you just add whatever's left of orig_string to list_of_lines.

Dan Over a year ago

If there is a space at character 20, wouldn't this cut it one word to short? i.e. before doing the rfind, you should check if orig_string[max_length] == ' '.

Green Cloak Guy Over a year ago

@Dan good point - though an easier way of handling that might be to just set max_length to 21.

jason Over a year ago

Thanks for the correction. This works. At first I was cursing Google for making the limit so low. Now I love it because I can break it up and multithread it and combine it and it's really fast how it converts.

Asclepius · Accepted Answer · 2024-05-09 03:36:07Z

This problem of chunking TTS inputs is more complicated than merely splitting over words or sentences or even paragraphs. Depending on the language, especially for English, modern neural TTS benefit from having contextual information of the surrounding words, especially the preceding words.

As such, instead of just splitting over words or sentences, a more correct way to split would be:

First at paragraph boundaries.
Then at sentence boundaries in case of paragraphs that are too large.
Then at word boundaries in case of sentences that are too large.
Lastly over characters in case of words that are too large.

After this, any consecutive splits that can be merged should be merged with appropriate separators. Overall, this keeps the chunks more intelligible for the TTS.

Python packages like semantic-text-splitter and semchunk generalize this task of "semantic splitting". Here is a solution using semantic-text-splitter:

from semantic_text_splitter import TextSplitter

def semantic_split(text: str, limit: int) -> list[str]:
    """Return a list of chunks from the given text, splitting it at semantically sensible boundaries while applying the specified character length limit for each chunk."""
    # Ref: https://stackoverflow.com/a/78288960/
    splitter = TextSplitter(limit)
    chunks = splitter.chunks(text)
    return chunks

LIMIT = 50
TEXT = """“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”"""
chunks = semantic_split(TEXT, LIMIT)

# Print chunks:
for num, chunk in enumerate(chunks, start=1):
    print({"#": num, "len": len(chunk), "chunk": chunk})

Output:

{'#': 1, 'len': 46, 'chunk': '“Well, Prince, so Genoa and Lucca are now just'}
{'#': 2, 'len': 34, 'chunk': 'family estates of the Buonapartes.'}
{'#': 3, 'len': 46, 'chunk': 'But I warn you, if you don’t tell me that this'}
{'#': 4, 'len': 50, 'chunk': 'means war, if you still try to defend the infamies'}
{'#': 5, 'len': 44, 'chunk': 'and horrors perpetrated by that Antichrist—I'}
{'#': 6, 'len': 43, 'chunk': 'really believe he is Antichrist—I will have'}
{'#': 7, 'len': 49, 'chunk': 'nothing more to do with you and you are no longer'}
{'#': 8, 'len': 48, 'chunk': 'my friend, no longer my ‘faithful slave,’ as you'}
{'#': 9, 'len': 33, 'chunk': 'call yourself! But how do you do?'}
{'#': 10, 'len': 48, 'chunk': 'I see I have frightened you—sit down and tell me'}
{'#': 11, 'len': 14, 'chunk': 'all the news.”'}

Parand · Accepted Answer · 2023-09-06 23:42:00Z

0

Building on Mark's answer, it looks like there's a small bug in the code when dealing with the end of the search, something like this might work:

    def text_to_chunks(s, maxlength):
        start = 0
        end   = 0
        while start + maxlength  < len(s) and end != -1:
            end = s.rfind(" ", start, start + maxlength + 1)
            if end == -1: break
            yield s[start:end]
            start = end +1
        yield s[start:]

answered Sep 6, 2023 at 23:42

Parand

107k49 gold badges158 silver badges188 bronze badges

Comments

Ouroborus · Accepted Answer · 2019-07-13 22:54:43Z

-2

You can use the nltk.tokenize methods as follows:

import nltk

corpus = '''
Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.” 
'''

tokens = nltk.tokenize.word_tokenize(corpus)

or

sent_tokens = nltk.tokenize.sent_tokenize(corpus)

edited Jul 13, 2019 at 22:54

Ouroborus

17k8 gold badges42 silver badges65 bronze badges

answered Jul 13, 2019 at 22:52

sheth7

3493 silver badges15 bronze badges

2 Comments

Mark Over a year ago

How does this create chunks of a certain size?

Asclepius Over a year ago

nltk.tokenize could be relevant as a low-level utility for constructing a splitter, but it doesn't in itself look to provide all that is necessary for the problem at hand.

Collectives™ on Stack Overflow

Python: splitting a long TTS input text into chunks of strings, given character limit

5 Answers 5

5 Comments

5 Comments

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

5 Comments

5 Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related