Splitting input text into fixed-size overlapping word chunks

Question

I’ve implemented a small utility function in Python 3.11 that takes an input string, splits it into word-based chunks of a given size, and allows a specified overlap between consecutive chunks. This is part of a text-processing pipeline where I need to feed manageable, context-aware slices into downstream NLP tasks.

What it should do:

Break a long string into chunks of up to chunk_size words
Ensure each chunk (except the first) starts overlap words before the end of the previous chunk
Return a list of strings, each containing up to chunk_size words
Handle edge cases like empty input, exact fits, zero overlap, and invalid parameters

Environment & background

Language: Python 3.11
Context: Preprocessing for an LLM embedding service (chunks feed sequentially)
Experience level: Intermediate Python
Not a homework or interview question—just looking for best practices and bug-checks

Review focus

Correctness across all edge cases (empty text, overlap >= chunk_size, final shorter chunk)
Pythonic style, readability, and idiomatic use of standard library
Efficiency for very large inputs
Clear error handling and type annotations
Suggestions for alternative approaches (e.g., itertools, deque, third-party helpers)

Code

def chunk_text(
    text: str,
    chunk_size: int = 500,
    overlap: int = 50
) -> list[str]:
    """
    Split `text` into word-based chunks of size `chunk_size`, with
    `overlap` words repeated between consecutive chunks.

    Args:
        text: Full input text.
        chunk_size: Number of words per chunk.
        overlap: Number of overlapping words between chunks.
    Returns:
        List of text chunks.
    """
    words = text.split()
    total = len(words)
    if total <= chunk_size:
        return [text]

    # Must advance start by chunk_size - overlap
    step = chunk_size - overlap
    if step <= 0:
        raise ValueError("Overlap must be smaller than chunk size")

    chunks: list[str] = []
    for start in range(0, total, step):
        end = start + chunk_size
        chunk_words = words[start:end]
        if not chunk_words:
            break

        chunks.append(' '.join(chunk_words))
        if end >= total:
            break

    return chunks

Unit tests (pytest) I’ve also written a suite to validate (some) behaviour:

import pytest

from utils import chunk_text


def make_text(n: int) -> str:
    return " ".join(f"w{i}" for i in range(n))


def test_short_text_returns_single_chunk():
    assert chunk_text(make_text(5), chunk_size=10, overlap=3) == [make_text(5)]


def test_exact_chunk_size_returns_single_chunk():
    assert chunk_text(make_text(10), chunk_size=10, overlap=3) == [make_text(10)]


def test_multiple_chunks_with_overlap_and_final_smaller_chunk():
    chunks = chunk_text(make_text(100), chunk_size=50, overlap=10)
    # Expected: [0:50], [40:90], [80:100]
    assert len(chunks) == 3
    assert [len(c.split()) for c in chunks] == [50, 50, 20]
    assert chunks[0].split()[0] == "w0" and chunks[0].split()[-1] == "w49"
    assert chunks[1].split()[0] == "w40" and chunks[1].split()[-1] == "w89"
    assert chunks[2].split()[0] == "w80" and chunks[2].split()[-1] == "w99"


def test_empty_text_returns_empty_string_chunk():
    assert chunk_text("", chunk_size=10, overlap=2) == [""]


def test_zero_overlap_behaviour():
    chunks = chunk_text(make_text(12), chunk_size=5, overlap=0)
    assert [len(c.split()) for c in chunks] == [5, 5, 2]


if __name__ == "__main__":
    pytest.main()

Reinderien · Accepted Answer · 2025-06-14 18:26:20Z

Your tests are pretty good, and helped me during refactoring. I think they all make sense save the empty case: in my refactor an empty string generates no chunks rather than one empty chunk.

I can't complain about much. What I'll show is an alternative implementation that is fully lazy. This is unlikely to help performance for small input. For large input, especially if the outer consumer is able to early-terminate, it may help reduce memory consumption and execution time as compared to your original implementation. More micro-optimisation is possible that I don't show, such as caching partial string joins.

import itertools
import re
import typing

import pytest


# Match on each word having no whitespace
WORD = re.compile(r'\S+')


def chunk_text(
    text: str,
    chunk_size: int = 500,
    overlap: int = 50
) -> typing.Iterator[str]:
    """
    Split `text` into word-based chunks of size `chunk_size`, with
    `overlap` words repeated between consecutive chunks.

    Args:
        text: Full input text.
        chunk_size: Number of words per chunk.
        overlap: Number of overlapping words between chunks.
    Returns:
        Iterator of text chunks.
    """
    words = (match[0] for match in WORD.finditer(text))
    queue = []

    while True:
        n_old = len(queue)
        queue.extend(itertools.islice(words, chunk_size - n_old))

        n_new = len(queue)
        if n_old == n_new:
            break   # if no new words were read

        # Some new words were read; yield this chunk even if it's short
        yield ' '.join(queue)

        # Leave only the end overlap for the next loop
        del queue[:n_new - overlap]


def make_text(n: int) -> str:
    return " ".join(f"w{i}" for i in range(n))


def test_empty_text_returns_empty_string_chunk():
    # BEHAVIOUR CHANGE:
    # An empty string has _no_ chunks, so this should be the empty list
    assert list(chunk_text("", chunk_size=10, overlap=2)) == []


def test_short_text_returns_single_chunk():
    assert list(chunk_text(make_text(5), chunk_size=10, overlap=3)) == [make_text(5)]


def test_zero_overlap_behaviour():
    chunks = list(chunk_text(make_text(12), chunk_size=5, overlap=0))
    assert [len(c.split()) for c in chunks] == [5, 5, 2]


def test_exact_chunk_size_returns_single_chunk():
    assert list(chunk_text(make_text(10), chunk_size=10, overlap=3)) == [make_text(10)]


def test_multiple_chunks_with_overlap_and_final_smaller_chunk():
    chunks = list(chunk_text(make_text(100), chunk_size=50, overlap=10))
    # Expected: [0:50], [40:90], [80:100]
    assert len(chunks) == 3
    assert [len(c.split()) for c in chunks] == [50, 50, 20]
    assert chunks[0].split()[0] == "w0" and chunks[0].split()[-1] == "w49"
    assert chunks[1].split()[0] == "w40" and chunks[1].split()[-1] == "w89"
    assert chunks[2].split()[0] == "w80" and chunks[2].split()[-1] == "w99"


if __name__ == "__main__":
    pytest.main()

============================= test session starts =============================
collecting ... collected 5 items

297337.py::test_empty_text_returns_empty_string_chunk PASSED             [ 20%]
297337.py::test_short_text_returns_single_chunk PASSED                   [ 40%]
297337.py::test_zero_overlap_behaviour PASSED                            [ 60%]
297337.py::test_exact_chunk_size_returns_single_chunk PASSED             [ 80%]
297337.py::test_multiple_chunks_with_overlap_and_final_smaller_chunk PASSED [100%]

============================== 5 passed in 0.03s ==============================

Using a deque instead of a list is also possible, but is not a clear win because left-pop needs to be done one at a time instead of using a range:

def chunk_text(
    text: str,
    chunk_size: int = 500,
    overlap: int = 50
) -> typing.Iterator[str]:
    """
    Split `text` into word-based chunks of size `chunk_size`, with
    `overlap` words repeated between consecutive chunks.

    Args:
        text: Full input text.
        chunk_size: Number of words per chunk.
        overlap: Number of overlapping words between chunks.
    Returns:
        Iterator of text chunks.
    """
    words = (match[0] for match in WORD.finditer(text))
    queue = collections.deque(maxlen=chunk_size)

    while True:
        n_old = len(queue)
        queue.extend(itertools.islice(words, chunk_size - n_old))

        n_new = len(queue)
        if n_old == n_new:
            break   # if no new words were read

        # Some new words were read; yield this chunk even if it's short
        yield ' '.join(queue)

        # Leave only the end overlap for the next loop
        for _ in range(n_new - overlap):
            queue.popleft()

Booboo · Accepted Answer · 2025-06-12 21:16:27Z

Remove Unnecessary Checks

In your for loop body the check if not chunk_words: break seems unnecessary as chunk_words must contain at least one word¹.

Footnotes

¹ The Proof

The fact that you are still executing the loop's body implies that 0 <= start < total is True.
Since we know that chunk_size must be at least 1, then start + chunksize > start. And since this expression is assigned to end, then end > start.
Finally, since total is the length of the input and because of assertion #1 above, the expression words[start:end] must have length >= 1.

toolic · Accepted Answer · 2025-06-11 11:24:23Z

Here are some minor suggestions.

Documentation

The docstring for the function is very well-structured, with the inputs and return values clearly described. The type hints are excellent.

Consider explicitly stating that words are separated by whitespace and that multiple consecutive spaces (or tabs, etc.) will be collapsed into a single space in the returned chunks.

I am a little unclear on the meaning of overlap. Perhaps a small example in the docstring would help. Is an overlap of 0 valid, and does it mean there is no overlap?

Naming

The variable named total is a bit vague. I think total_words is more specific.

Testing

If you are not already doing so, I think adding a variety of whitespace to the input string would be worth testing. For example, add a mixture of multiple consecutive spaces, tabs and newlines throughout the string.

Add a test with leading and trailing whitespace in the input string.

Having a string where words are relatively sparse as compared to the whitespace might be an interesting case.

If punctuation characters are allowed in the input, create a test with a lot of those as well: ;:'",.

Stack Exchange Network

Splitting input text into fixed-size overlapping word chunks

3 Answers 3

Remove Unnecessary Checks

Documentation

Naming

Testing

You must log in to answer this question.

Hot Network Questions

Splitting input text into fixed-size overlapping word chunks

3 Answers 3

Remove Unnecessary Checks

Documentation

Naming

Testing

You must log in to answer this question.

Related

Hot Network Questions