post agentic · 2025-11-22 · 6 min read

RAG chunking strategies that actually work for technical content

#rag#llm#embeddings#vector-search#retrieval

The first RAG system I shipped used 512-token fixed-size chunks with 50-token overlap. It worked OK for the dataset I tested on (Wikipedia articles). It worked terribly for the dataset I deployed on (technical API documentation). The retriever returned chunks that started mid-sentence, ended mid-code-block, and split function definitions across three results. The LLM, given those chunks, responded with confident-sounding nonsense.

This post is the six chunking strategies I’ve tried, the specific failures of each, and the hybrid approach I now use when retrieval quality is load-bearing.

Why chunking matters more than people think

The LLM only sees what the retriever returns. If the retriever returns garbage chunks, no amount of prompt engineering recovers. Chunking is the upstream decision that constrains everything downstream.

Two failure modes from bad chunking:

Semantic fragmentation: a chunk contains the start of an explanation but not the finish. The LLM has half the answer; it confabulates the other half.
Context loss: a chunk references “this function” or “the above example” but the antecedent isn’t in the chunk. The LLM either hallucinates the antecedent or hedges.

Both look like “the LLM is bad at this question” when actually the retriever fed it incomplete information.

Strategy 1: fixed-size chunks

The default. Split text into N-token chunks with M-token overlap.

def fixed_chunks(text: str, size: int = 512, overlap: int = 50) -> list[str]:
    tokens = tokenize(text)
    chunks = []
    for i in range(0, len(tokens), size - overlap):
        chunks.append(detokenize(tokens[i:i + size]))
    return chunks

Where it works:

Prose with no internal structure (blog posts, articles).
When you genuinely don’t know the document shape.

Where it breaks:

Code samples: gets cut in the middle of a function definition.
Tables: rows split across chunks.
Numbered lists: items 1-3 in one chunk, items 4-7 in another, no list context.
Markdown: a section heading lands in the chunk after its content.

Verdict: fine baseline for unstructured prose. Stop here if your data is exactly that.

Strategy 2: structural chunking on document syntax

Split on the document’s natural structure. For Markdown: split on ## and ### headings. For code: split on function/class boundaries. For HTML: split on <section> or <article> tags.

import re

def markdown_chunks_by_heading(text: str, level: int = 2) -> list[str]:
    pattern = rf"^#{{1,{level}}}\s+.+$"
    chunks = []
    current = []
    for line in text.split("\n"):
        if re.match(pattern, line, re.MULTILINE) and current:
            chunks.append("\n".join(current))
            current = [line]
        else:
            current.append(line)
    if current:
        chunks.append("\n".join(current))
    return chunks

Where it works:

Documents with clear section boundaries (most technical docs).
Code repos chunked by function or class.

Where it breaks:

Sections of dramatically different lengths. A 50-token “Overview” section and a 4000-token “Implementation” section in the same vector store throws off ranking.
Documents without clear structure.

Verdict: dramatically better than fixed-size for any document with headings or code structure. Combine with size constraints (split big sections, merge tiny ones) for production.

Strategy 3: recursive chunking

Try to split on big-grained boundaries first; if a piece is still too big, fall back to medium-grained, then small-grained.

def recursive_chunks(text: str, max_tokens: int = 800) -> list[str]:
    # Try splitting on double newlines (paragraphs)
    pieces = text.split("\n\n")
    chunks = []
    for p in pieces:
        if token_count(p) <= max_tokens:
            chunks.append(p)
        else:
            # Too big, try sentence splitting
            chunks.extend(split_into_sentences(p, max_tokens))
    return chunks

This is the strategy LangChain’s RecursiveCharacterTextSplitter implements. It tries paragraphs first, then sentences, then characters.

Where it works:

General-purpose; better than fixed-size in nearly every case.
Good default if you don’t know your document shape.

Where it breaks:

Doesn’t respect semantic boundaries, just structural ones. A long paragraph that happens to fit in max_tokens becomes one chunk even if it covers two distinct topics.
Code blocks within prose still get split.

Verdict: a reasonable default. Works better than fixed-size, doesn’t require document-shape-specific code.

Strategy 4: semantic chunking (embedding-based)

Compute embeddings sentence-by-sentence. Look at consecutive-sentence cosine similarity. When similarity drops below a threshold, that’s a topic boundary; cut there.

def semantic_chunks(sentences: list[str], threshold: float = 0.5) -> list[str]:
    embeddings = embed_model.encode(sentences)
    chunks = []
    current = [sentences[0]]
    for i in range(1, len(sentences)):
        sim = cosine_similarity(embeddings[i-1], embeddings[i])
        if sim < threshold:
            chunks.append(" ".join(current))
            current = []
        current.append(sentences[i])
    if current:
        chunks.append(" ".join(current))
    return chunks

Where it works:

Long-form prose where topic boundaries don’t align with formatting.
Conversational transcripts.
Documents that have been auto-generated and lack good structure.

Where it breaks:

Slow: needs an embedding per sentence on top of the chunk-level embeddings. Index time grows ~10x.
Threshold tuning: too high → too many tiny chunks; too low → no chunking.
Worse than structural chunking when structure exists. If your doc has good headings, semantic chunking adds cost without quality lift.

Verdict: useful for prose without structure. Skip for technical docs that already have headings.

Strategy 5: parent-child chunking

Two separate stores. The retriever queries small chunks (fast, precise matching). When a small chunk hits, return its larger parent (richer context for the LLM).

# Index time
for doc in documents:
    parents = chunk_by_section(doc)             # ~2000-token chunks
    for parent_id, parent_text in parents:
        store_parent(parent_id, parent_text)
        children = chunk_by_paragraph(parent_text)  # ~200-token chunks
        for child_text in children:
            embedding = embed(child_text)
            vector_store.add(embedding, child_text, parent_id=parent_id)

# Query time
hits = vector_store.search(query_embedding, k=5)
parent_ids = {hit.parent_id for hit in hits}
context = [load_parent(pid) for pid in parent_ids]

The match happens against small chunks (high precision), but the LLM gets the larger context (low fragmentation). Best of both.

Where it works:

Production RAG over large technical corpora.
When retrieved chunks alone don’t carry enough context for the LLM.

Where it breaks:

More complex (two stores, more code paths).
Parent chunks may include irrelevant content (the matched child is one paragraph in a 5-paragraph parent).
Latency: loading parents adds a step.

Verdict: my default for production RAG over technical docs. The complexity is worth it.

Strategy 6: late chunking

Embed the whole document, then derive chunk embeddings from the document-level encoding. Specifically, run the embedding model on the full text, then average-pool the per-token embeddings within each chunk.

This means every chunk’s embedding is informed by the entire document context. A chunk talking about “the function” has the function name encoded in it because the embedding saw the whole document.

# Pseudo-code; needs an embedding model that exposes per-token outputs
full_text = doc.text
chunks = recursive_chunks(full_text, max_tokens=800)
chunk_offsets = compute_offsets(chunks, full_text)

# Get per-token embeddings for the entire document
token_embeddings = embed_model.encode(full_text, return_token_embeddings=True)

# Pool per-token embeddings into per-chunk embeddings
chunk_embeddings = []
for start, end in chunk_offsets:
    chunk_emb = token_embeddings[start:end].mean(dim=0)
    chunk_embeddings.append(chunk_emb)

This is a 2024 technique (Jina AI proposed it). It’s measurably better than naive embedding for documents where context outside the chunk matters.

Where it works:

Long documents where every chunk references unstated context.
High-stakes retrieval where every percentage point matters.

Where it breaks:

Requires an embedding model with token-level output access (limits which models you can use).
Compute cost: encoding the full document is O(N²) attention (or O(N log N) with sliding-window attention).
Limited to documents that fit the model’s context window.

Verdict: cutting-edge, worth experimenting with for top-tier retrieval quality. Not yet a default.

What I actually do for technical docs in 2025

Hybrid: structural primary + recursive fallback + parent-child for context.

1. Try structural split (markdown headings, code blocks, function defs).
2. For any structural chunk over max_tokens, recursive-split.
3. Keep both the small structural/recursive chunks (children) AND
   the larger section-level chunks (parents).
4. Embed the children, store with parent_id.
5. At query time: retrieve top-K children, dedupe by parent_id,
   return parent texts to the LLM.

Why it works in practice:

Structural primary keeps semantic boundaries intact.
Recursive fallback handles oversized sections.
Parent-child means matches are precise but the LLM sees full context.

For a 200-page technical doc, this strategy gets me retrieval-quality numbers (NDCG@5, mean-reciprocal-rank) ~25% higher than naive recursive chunking.

Things I no longer do

Pick fixed-size 512/50 because LangChain defaults to it. That default is a starting point, not a recommendation.
Skip the structural split because “the doc is just text”. Almost no document is just text. Markdown, HTML, code blocks, tables — they all have structure worth respecting.
Tune chunk size in isolation. Chunk size interacts with retrieval k, embedding model dimensionality, LLM context budget. Tune the system, not one knob.

How to measure if your chunking is working

You can’t fix what you can’t measure. Build an eval set of 50–100 (question, ideal-source-passage) pairs from real users. Then measure:

Recall@K: does the top-K retrieval contain the ideal passage?
Precision@K: is the ideal passage near the top, or buried at K=10?
Chunk fragmentation rate: of the retrieved chunks, what fraction are mid-sentence or mid-codeblock?

Improve chunking, re-run the eval, compare. Without this loop, you’re guessing.

Closing

Chunking isn’t a hyperparameter; it’s a design decision. The strategy that fits your data shape moves retrieval quality more than swapping embedding models, more than re-ranking, more than prompt iteration. For technical content: structural primary, recursive fallback, parent-child for context. For unstructured prose: recursive splitter is fine. Measure with a small eval set. Tune the system, not one knob in isolation.