I’ve implemented a small utility function in Python 3.11 that takes an input string, splits it into word-based chunks of a given size, and allows a specified overlap between consecutive chunks. This is part of a text-processing pipeline where I need to feed manageable, context-aware slices into downstream NLP tasks.
What it should do:
- Break a long string into chunks of up to
chunk_sizewords - Ensure each chunk (except the first) starts
overlapwords before the end of the previous chunk - Return a list of strings, each containing up to
chunk_sizewords - Handle edge cases like empty input, exact fits, zero overlap, and invalid parameters
Environment & background
- Language: Python 3.11
- Context: Preprocessing for an LLM embedding service (chunks feed sequentially)
- Experience level: Intermediate Python
- Not a homework or interview question—just looking for best practices and bug-checks
Review focus
- Correctness across all edge cases (empty text, overlap >= chunk_size, final shorter chunk)
- Pythonic style, readability, and idiomatic use of standard library
- Efficiency for very large inputs
- Clear error handling and type annotations
- Suggestions for alternative approaches (e.g., itertools, deque, third-party helpers)
Code
def chunk_text(
text: str,
chunk_size: int = 500,
overlap: int = 50
) -> list[str]:
"""
Split `text` into word-based chunks of size `chunk_size`, with
`overlap` words repeated between consecutive chunks.
Args:
text: Full input text.
chunk_size: Number of words per chunk.
overlap: Number of overlapping words between chunks.
Returns:
List of text chunks.
"""
words = text.split()
total = len(words)
if total <= chunk_size:
return [text]
# Must advance start by chunk_size - overlap
step = chunk_size - overlap
if step <= 0:
raise ValueError("Overlap must be smaller than chunk size")
chunks: list[str] = []
for start in range(0, total, step):
end = start + chunk_size
chunk_words = words[start:end]
if not chunk_words:
break
chunks.append(' '.join(chunk_words))
if end >= total:
break
return chunks
Unit tests (pytest) I’ve also written a suite to validate (some) behaviour:
import pytest
from utils import chunk_text
def make_text(n: int) -> str:
return " ".join(f"w{i}" for i in range(n))
def test_short_text_returns_single_chunk():
assert chunk_text(make_text(5), chunk_size=10, overlap=3) == [make_text(5)]
def test_exact_chunk_size_returns_single_chunk():
assert chunk_text(make_text(10), chunk_size=10, overlap=3) == [make_text(10)]
def test_multiple_chunks_with_overlap_and_final_smaller_chunk():
chunks = chunk_text(make_text(100), chunk_size=50, overlap=10)
# Expected: [0:50], [40:90], [80:100]
assert len(chunks) == 3
assert [len(c.split()) for c in chunks] == [50, 50, 20]
assert chunks[0].split()[0] == "w0" and chunks[0].split()[-1] == "w49"
assert chunks[1].split()[0] == "w40" and chunks[1].split()[-1] == "w89"
assert chunks[2].split()[0] == "w80" and chunks[2].split()[-1] == "w99"
def test_empty_text_returns_empty_string_chunk():
assert chunk_text("", chunk_size=10, overlap=2) == [""]
def test_zero_overlap_behaviour():
chunks = chunk_text(make_text(12), chunk_size=5, overlap=0)
assert [len(c.split()) for c in chunks] == [5, 5, 2]
if __name__ == "__main__":
pytest.main()