post · 2026-04-29 · 6 min read

Why your top-K vector search returns 8 markers stacked on Manhattan

#vector-search#mmr#similarity#learning#engineering

The first time I plotted similarity-search results on a map I learned an annoying truth about cosine top-K. The query was “places like Times Square in the United States”. The result was 15 markers, 12 of which were stacked on top of each other across six city blocks of Midtown Manhattan.

The query is correct. The cosine math is correct. The result is useless.

This post is about why that happens, what to do about it, and a second fix on top of the first that production systems actually need.

See it for yourself

The widget below is the “places like Times Square” query with the same three knobs the production system exposes. Slide them and watch the result list (and the map pins) change in real time. No backend, the scores are hand-baked, but the behaviour matches what you would see if you wired this up against a real pgvector index.

interactive demo · places like Times Square in the USA

K (results)15

Diversity (MMR)0 = pure cosine top-K, 1 = max spread0.00

Min score (cosine threshold)drop cells whose similarity falls below this0.00

top 15 (29 eligible)

1Hell's Kitchen, NYC0.97
2Midtown East, NYC0.96
3Greenwich Village, NYC0.95
4Lower Manhattan, NYC0.94
5Union Square, NYC0.94
6Upper West Side, NYC0.92
7Williamsburg, Brooklyn0.90
8DC Downtown0.86
9Boston Back Bay0.85
10Cambridge MA0.84
11Philly Center City0.83
12SF Financial District0.83
13Chicago Loop0.82
14Miami Brickell0.81
15Atlanta Midtown0.80

Try this:

Leave diversity at 0 and pull K up to 20. You will see the top of the list saturate with NYC-borough cells before it ever crosses the Hudson. That is the failure mode.
Now slide diversity to ~0.5. The list rebalances: the top spots stay strong matches but the rest of the list spreads across DC, Boston, Chicago, SF.
Crank min score to 0.75. Suddenly only dense urban cores survive. Smaller cities drop out, and the top of the list is unchanged.

That sense, that the right defaults make the difference between useful and useless output, is what the rest of this post is about.

What top-K optimises for

Cosine top-K is doing exactly one thing: returning the K cells whose embedding vectors are closest to the query vector. There is nothing in that objective that says “and please make sure the K cells are not all next-door neighbours”. So when the input is “Times Square” and the embedding has learned that “dense urban core with high transit and 5000+ POIs” is a tight cluster, you get back the K cells that match that cluster best. Those cells happen to be physically adjacent because Times Square is surrounded by other Times-Square-like cells.

Cosine top-K is a similarity ranker, not a recommender. A recommender’s job is to surface a set of useful options. A ranker’s job is to rank by similarity. We need to layer something on top.

Fix 1: MMR (Maximal Marginal Relevance)

The classic fix is MMR, from a 1998 paper by Carbonell and Goldstein. The idea is one line:

score_i = lambda * similarity(query, candidate_i)
       - (1 - lambda) * max similarity(candidate_i, already_selected)

For each new pick, balance two things:

How similar this candidate is to the query (we still want relevant).
How similar this candidate is to anything we have already picked (penalise redundancy).

The lambda knob lets the caller decide where on that spectrum they want to be. lambda = 1 recovers vanilla cosine top-K. lambda = 0 is pure spread. Most useful values sit around 0.5.

In code (rough sketch):

def mmr(query_vec, candidate_vecs, candidate_ids, k, lam=0.5):
    selected = []
    selected_vecs = []
    remaining = list(range(len(candidate_ids)))

    while len(selected) < k and remaining:
        best_score = -float('inf')
        best_i = None
        for i in remaining:
            sim_to_query = cosine(query_vec, candidate_vecs[i])
            redundancy = max(
                (cosine(candidate_vecs[i], v) for v in selected_vecs),
                default=0.0,
            )
            score = lam * sim_to_query - (1 - lam) * redundancy
            if score > best_score:
                best_score = score
                best_i = i
        selected.append(candidate_ids[best_i])
        selected_vecs.append(candidate_vecs[best_i])
        remaining.remove(best_i)
    return selected

This is O(K * N) for K results from N candidates. To keep it cheap at query time:

Do not run MMR over all 2.5 million cells in the database. Pull a top_M (say M = 4 * K) by raw cosine first.
Run MMR on the M candidates only.
You need the candidate vectors for the redundancy calculation, so the SQL has to include the embedding column, not just the row metadata.

The pgvector flow looks like this:

SELECT cell_id, lat, lon, embedding,
       1 - (embedding <=> $1::vector) AS score
FROM cells
WHERE is_active
ORDER BY embedding <=> $1::vector
LIMIT $2;  -- $2 = 4 * K, the overfetch

Then MMR re-ranks the result list in Python.

What MMR doesn’t fix

Run “Times Square” with lambda = 0.5 and the result improves a lot. You start seeing Boston, DC, Chicago, SF appear instead of just Midtown blocks. But you can still get two SF cells in the top 15 if they are 2 km apart and structurally distinct in the embedding. Visually, it still looks like a cluster on the map.

MMR optimises for diversity in embedding space. The map does not care about embedding space. The map cares about kilometres.

Fix 2: Geographic deduplication

This one is dumber and more effective. After all the vector ranking, do a final pass that walks the list in score order and drops anything within min_separation_km of an already-kept cell:

def dedup_geographic(results, min_km, k):
    kept = []
    for r in results:
        too_close = any(
            haversine_km(r.lat, r.lon, kr.lat, kr.lon) < min_km
            for kr in kept
        )
        if not too_close:
            kept.append(r)
            if len(kept) >= k:
                break
    return kept

Pair this with the MMR overfetch from before, set min_separation_km to something like 25 km for “places like Paris in USA”, and the result list goes from “15 markers stacked on Manhattan” to “NYC, Miami, DC, LA, SF, Boston, Atlanta, San Diego, Fort Lauderdale, Raleigh” in 10 lines.

The cost is one extra haversine per kept candidate per remaining candidate. For typical K=15 and overfetch=60, that is at most 900 cheap distance computations. Negligible compared to the cosine top-K it sits on top of.

Why both, not just one

It is tempting to skip MMR and use only the geographic dedup. After all, dedup-by-distance feels stronger. But pure dedup leaves a hole: if your top-50 candidates are all in one country because the embedding has a strong country bias, geographic dedup spreads them out within that country. You still get 15 different US cities and zero European cities, even though MIT and Heidelberg are structurally very similar.

MMR pulls across the embedding space. Geographic dedup pulls across the map. They solve different failure modes. You want both.

Defaults that don’t make the user think

Here is where I landed for production defaults:

lambda (MMR diversity): 0 by default. Most queries do not need it. Expose a slider for the user to crank up when they want.
min_separation_km: 10 by default. Always on. Catches the worst clustering with no user awareness needed.
Overfetch: 4 * K, capped at 200. Anything more is wasted cosine.

A naive query path looks like “call vector DB, return top 15”. The actual production query path looks like:

1. Filter by region / attribute / proximity
2. Cosine top-K with overfetch
3. MMR rerank if diversity > 0
4. Geographic dedup if min_separation_km > 0
5. Hydrate metadata, return

Five steps, but each is independently understandable, and the failure modes of each are well-defined.

The lesson

Production similarity search is not about getting the cosine math right. The cosine math is given. Production similarity search is about layering filters and rerankers on top of cosine until the result feels right when you put it on a map (or in a list, or in front of a user). Vector DBs ship the easy part. The interesting work is everything that wraps it.

If your top-K results look great in a Jupyter notebook but bad on a map, you probably have not added the geographic deduplication step yet.