post agentic · 2025-11-22 · 6 min read

RAG chunking strategies that actually work for technical content

#rag#llm#embeddings#vector-search#retrieval

The first RAG system I shipped used 512-token fixed-size chunks with 50-token overlap. It worked OK for the dataset I tested on (Wikipedia articles). It worked terribly for the dataset I deployed on (technical API documentation). The retriever returned chunks that started mid-sentence, ended mid-code-block, and split function definitions across three results. The LLM, given those chunks, responded with confident-sounding nonsense.

This post is the six chunking strategies I’ve tried, the specific failures of each, and the hybrid approach I now use when retrieval quality is load-bearing.

Why chunking matters more than people think

The LLM only sees what the retriever returns. If the retriever returns garbage chunks, no amount of prompt engineering recovers. Chunking is the upstream decision that constrains everything downstream.

Two failure modes from bad chunking:

  1. Semantic fragmentation: a chunk contains the start of an explanation but not the finish. The LLM has half the answer; it confabulates the other half.
  2. Context loss: a chunk references “this function” or “the above example” but the antecedent isn’t in the chunk. The LLM either hallucinates the antecedent or hedges.

Both look like “the LLM is bad at this question” when actually the retriever fed it incomplete information.

Strategy 1: fixed-size chunks

The default. Split text into N-token chunks with M-token overlap.

def fixed_chunks(text: str, size: int = 512, overlap: int = 50) -> list[str]:
tokens = tokenize(text)
chunks = []
for i in range(0, len(tokens), size - overlap):
chunks.append(detokenize(tokens[i:i + size]))
return chunks

Where it works:

Where it breaks:

Verdict: fine baseline for unstructured prose. Stop here if your data is exactly that.

Strategy 2: structural chunking on document syntax

Split on the document’s natural structure. For Markdown: split on ## and ### headings. For code: split on function/class boundaries. For HTML: split on <section> or <article> tags.

import re
def markdown_chunks_by_heading(text: str, level: int = 2) -> list[str]:
pattern = rf"^#{{1,{level}}}\s+.+$"
chunks = []
current = []
for line in text.split("\n"):
if re.match(pattern, line, re.MULTILINE) and current:
chunks.append("\n".join(current))
current = [line]
else:
current.append(line)
if current:
chunks.append("\n".join(current))
return chunks

Where it works:

Where it breaks:

Verdict: dramatically better than fixed-size for any document with headings or code structure. Combine with size constraints (split big sections, merge tiny ones) for production.

Strategy 3: recursive chunking

Try to split on big-grained boundaries first; if a piece is still too big, fall back to medium-grained, then small-grained.

def recursive_chunks(text: str, max_tokens: int = 800) -> list[str]:
# Try splitting on double newlines (paragraphs)
pieces = text.split("\n\n")
chunks = []
for p in pieces:
if token_count(p) <= max_tokens:
chunks.append(p)
else:
# Too big, try sentence splitting
chunks.extend(split_into_sentences(p, max_tokens))
return chunks

This is the strategy LangChain’s RecursiveCharacterTextSplitter implements. It tries paragraphs first, then sentences, then characters.

Where it works:

Where it breaks:

Verdict: a reasonable default. Works better than fixed-size, doesn’t require document-shape-specific code.

Strategy 4: semantic chunking (embedding-based)

Compute embeddings sentence-by-sentence. Look at consecutive-sentence cosine similarity. When similarity drops below a threshold, that’s a topic boundary; cut there.

def semantic_chunks(sentences: list[str], threshold: float = 0.5) -> list[str]:
embeddings = embed_model.encode(sentences)
chunks = []
current = [sentences[0]]
for i in range(1, len(sentences)):
sim = cosine_similarity(embeddings[i-1], embeddings[i])
if sim < threshold:
chunks.append(" ".join(current))
current = []
current.append(sentences[i])
if current:
chunks.append(" ".join(current))
return chunks

Where it works:

Where it breaks:

Verdict: useful for prose without structure. Skip for technical docs that already have headings.

Strategy 5: parent-child chunking

Two separate stores. The retriever queries small chunks (fast, precise matching). When a small chunk hits, return its larger parent (richer context for the LLM).

# Index time
for doc in documents:
parents = chunk_by_section(doc) # ~2000-token chunks
for parent_id, parent_text in parents:
store_parent(parent_id, parent_text)
children = chunk_by_paragraph(parent_text) # ~200-token chunks
for child_text in children:
embedding = embed(child_text)
vector_store.add(embedding, child_text, parent_id=parent_id)
# Query time
hits = vector_store.search(query_embedding, k=5)
parent_ids = {hit.parent_id for hit in hits}
context = [load_parent(pid) for pid in parent_ids]

The match happens against small chunks (high precision), but the LLM gets the larger context (low fragmentation). Best of both.

Where it works:

Where it breaks:

Verdict: my default for production RAG over technical docs. The complexity is worth it.

Strategy 6: late chunking

Embed the whole document, then derive chunk embeddings from the document-level encoding. Specifically, run the embedding model on the full text, then average-pool the per-token embeddings within each chunk.

This means every chunk’s embedding is informed by the entire document context. A chunk talking about “the function” has the function name encoded in it because the embedding saw the whole document.

# Pseudo-code; needs an embedding model that exposes per-token outputs
full_text = doc.text
chunks = recursive_chunks(full_text, max_tokens=800)
chunk_offsets = compute_offsets(chunks, full_text)
# Get per-token embeddings for the entire document
token_embeddings = embed_model.encode(full_text, return_token_embeddings=True)
# Pool per-token embeddings into per-chunk embeddings
chunk_embeddings = []
for start, end in chunk_offsets:
chunk_emb = token_embeddings[start:end].mean(dim=0)
chunk_embeddings.append(chunk_emb)

This is a 2024 technique (Jina AI proposed it). It’s measurably better than naive embedding for documents where context outside the chunk matters.

Where it works:

Where it breaks:

Verdict: cutting-edge, worth experimenting with for top-tier retrieval quality. Not yet a default.

What I actually do for technical docs in 2025

Hybrid: structural primary + recursive fallback + parent-child for context.

1. Try structural split (markdown headings, code blocks, function defs).
2. For any structural chunk over max_tokens, recursive-split.
3. Keep both the small structural/recursive chunks (children) AND
the larger section-level chunks (parents).
4. Embed the children, store with parent_id.
5. At query time: retrieve top-K children, dedupe by parent_id,
return parent texts to the LLM.

Why it works in practice:

For a 200-page technical doc, this strategy gets me retrieval-quality numbers (NDCG@5, mean-reciprocal-rank) ~25% higher than naive recursive chunking.

Things I no longer do

How to measure if your chunking is working

You can’t fix what you can’t measure. Build an eval set of 50–100 (question, ideal-source-passage) pairs from real users. Then measure:

Improve chunking, re-run the eval, compare. Without this loop, you’re guessing.

Closing

Chunking isn’t a hyperparameter; it’s a design decision. The strategy that fits your data shape moves retrieval quality more than swapping embedding models, more than re-ranking, more than prompt iteration. For technical content: structural primary, recursive fallback, parent-child for context. For unstructured prose: recursive splitter is fine. Measure with a small eval set. Tune the system, not one knob in isolation.