post · 2026-04-29 · 6 min read
Why your top-K vector search returns 8 markers stacked on Manhattan
The first time I plotted similarity-search results on a map I learned an annoying truth about cosine top-K. The query was “places like Times Square in the United States”. The result was 15 markers, 12 of which were stacked on top of each other across six city blocks of Midtown Manhattan.
The query is correct. The cosine math is correct. The result is useless.
This post is about why that happens, what to do about it, and a second fix on top of the first that production systems actually need.
See it for yourself
The widget below is the “places like Times Square” query with the same three knobs the production system exposes. Slide them and watch the result list (and the map pins) change in real time. No backend, the scores are hand-baked, but the behaviour matches what you would see if you wired this up against a real pgvector index.
top 15 (29 eligible)
- 1Hell's Kitchen, NYC0.97
- 2Midtown East, NYC0.96
- 3Greenwich Village, NYC0.95
- 4Lower Manhattan, NYC0.94
- 5Union Square, NYC0.94
- 6Upper West Side, NYC0.92
- 7Williamsburg, Brooklyn0.90
- 8DC Downtown0.86
- 9Boston Back Bay0.85
- 10Cambridge MA0.84
- 11Philly Center City0.83
- 12SF Financial District0.83
- 13Chicago Loop0.82
- 14Miami Brickell0.81
- 15Atlanta Midtown0.80
Try this:
- Leave diversity at 0 and pull K up to 20. You will see the top of the list saturate with NYC-borough cells before it ever crosses the Hudson. That is the failure mode.
- Now slide diversity to ~0.5. The list rebalances: the top spots stay strong matches but the rest of the list spreads across DC, Boston, Chicago, SF.
- Crank min score to 0.75. Suddenly only dense urban cores survive. Smaller cities drop out, and the top of the list is unchanged.
That sense, that the right defaults make the difference between useful and useless output, is what the rest of this post is about.
What top-K optimises for
Cosine top-K is doing exactly one thing: returning the K cells whose embedding vectors are closest to the query vector. There is nothing in that objective that says “and please make sure the K cells are not all next-door neighbours”. So when the input is “Times Square” and the embedding has learned that “dense urban core with high transit and 5000+ POIs” is a tight cluster, you get back the K cells that match that cluster best. Those cells happen to be physically adjacent because Times Square is surrounded by other Times-Square-like cells.
Cosine top-K is a similarity ranker, not a recommender. A recommender’s job is to surface a set of useful options. A ranker’s job is to rank by similarity. We need to layer something on top.
Fix 1: MMR (Maximal Marginal Relevance)
The classic fix is MMR, from a 1998 paper by Carbonell and Goldstein. The idea is one line:
score_i = lambda * similarity(query, candidate_i) - (1 - lambda) * max similarity(candidate_i, already_selected)For each new pick, balance two things:
- How similar this candidate is to the query (we still want relevant).
- How similar this candidate is to anything we have already picked (penalise redundancy).
The lambda knob lets the caller decide where on that spectrum they want to be. lambda = 1 recovers vanilla cosine top-K. lambda = 0 is pure spread. Most useful values sit around 0.5.
In code (rough sketch):
def mmr(query_vec, candidate_vecs, candidate_ids, k, lam=0.5): selected = [] selected_vecs = [] remaining = list(range(len(candidate_ids)))
while len(selected) < k and remaining: best_score = -float('inf') best_i = None for i in remaining: sim_to_query = cosine(query_vec, candidate_vecs[i]) redundancy = max( (cosine(candidate_vecs[i], v) for v in selected_vecs), default=0.0, ) score = lam * sim_to_query - (1 - lam) * redundancy if score > best_score: best_score = score best_i = i selected.append(candidate_ids[best_i]) selected_vecs.append(candidate_vecs[best_i]) remaining.remove(best_i) return selectedThis is O(K * N) for K results from N candidates. To keep it cheap at query time:
- Do not run MMR over all 2.5 million cells in the database. Pull a
top_M(sayM = 4 * K) by raw cosine first. - Run MMR on the M candidates only.
- You need the candidate vectors for the redundancy calculation, so the SQL has to include the embedding column, not just the row metadata.
The pgvector flow looks like this:
SELECT cell_id, lat, lon, embedding, 1 - (embedding <=> $1::vector) AS scoreFROM cellsWHERE is_activeORDER BY embedding <=> $1::vectorLIMIT $2; -- $2 = 4 * K, the overfetchThen MMR re-ranks the result list in Python.
What MMR doesn’t fix
Run “Times Square” with lambda = 0.5 and the result improves a lot. You start seeing Boston, DC, Chicago, SF appear instead of just Midtown blocks. But you can still get two SF cells in the top 15 if they are 2 km apart and structurally distinct in the embedding. Visually, it still looks like a cluster on the map.
MMR optimises for diversity in embedding space. The map does not care about embedding space. The map cares about kilometres.
Fix 2: Geographic deduplication
This one is dumber and more effective. After all the vector ranking, do a final pass that walks the list in score order and drops anything within min_separation_km of an already-kept cell:
def dedup_geographic(results, min_km, k): kept = [] for r in results: too_close = any( haversine_km(r.lat, r.lon, kr.lat, kr.lon) < min_km for kr in kept ) if not too_close: kept.append(r) if len(kept) >= k: break return keptPair this with the MMR overfetch from before, set min_separation_km to something like 25 km for “places like Paris in USA”, and the result list goes from “15 markers stacked on Manhattan” to “NYC, Miami, DC, LA, SF, Boston, Atlanta, San Diego, Fort Lauderdale, Raleigh” in 10 lines.
The cost is one extra haversine per kept candidate per remaining candidate. For typical K=15 and overfetch=60, that is at most 900 cheap distance computations. Negligible compared to the cosine top-K it sits on top of.
Why both, not just one
It is tempting to skip MMR and use only the geographic dedup. After all, dedup-by-distance feels stronger. But pure dedup leaves a hole: if your top-50 candidates are all in one country because the embedding has a strong country bias, geographic dedup spreads them out within that country. You still get 15 different US cities and zero European cities, even though MIT and Heidelberg are structurally very similar.
MMR pulls across the embedding space. Geographic dedup pulls across the map. They solve different failure modes. You want both.
Defaults that don’t make the user think
Here is where I landed for production defaults:
lambda(MMR diversity):0by default. Most queries do not need it. Expose a slider for the user to crank up when they want.min_separation_km:10by default. Always on. Catches the worst clustering with no user awareness needed.- Overfetch:
4 * K, capped at 200. Anything more is wasted cosine.
A naive query path looks like “call vector DB, return top 15”. The actual production query path looks like:
1. Filter by region / attribute / proximity2. Cosine top-K with overfetch3. MMR rerank if diversity > 04. Geographic dedup if min_separation_km > 05. Hydrate metadata, returnFive steps, but each is independently understandable, and the failure modes of each are well-defined.
The lesson
Production similarity search is not about getting the cosine math right. The cosine math is given. Production similarity search is about layering filters and rerankers on top of cosine until the result feels right when you put it on a map (or in a list, or in front of a user). Vector DBs ship the easy part. The interesting work is everything that wraps it.
If your top-K results look great in a Jupyter notebook but bad on a map, you probably have not added the geographic deduplication step yet.