project · 2026

Multi-country geospatial similarity engine

A geospatial similarity engine that learns the structural fingerprint of a place from OpenStreetMap features and returns 'what other places are like this?' across multiple countries. Custom-trained vision-transformer embeddings, pgvector, MCP server, agent-driven map UI. Built end-to-end.

A geospatial similarity engine that answers cross-country questions like “places like Heidelberg in the US” or “university towns anywhere on the continent”. Built end-to-end, model training, infrastructure, query layer, and the agent-facing UI.

Try the query knobs

The widget below mirrors the production query layer. Hardcoded scores, no live backend, but the three knobs (K, MMR diversity, min cosine) drive the result list and map pins exactly the way the real system does. Drag any of them.

interactive demo · places like Times Square in the USA

top 15 (29 eligible)

  1. 1Hell's Kitchen, NYC0.97
  2. 2Midtown East, NYC0.96
  3. 3Greenwich Village, NYC0.95
  4. 4Lower Manhattan, NYC0.94
  5. 5Union Square, NYC0.94
  6. 6Upper West Side, NYC0.92
  7. 7Williamsburg, Brooklyn0.90
  8. 8DC Downtown0.86
  9. 9Boston Back Bay0.85
  10. 10Cambridge MA0.84
  11. 11Philly Center City0.83
  12. 12SF Financial District0.83
  13. 13Chicago Loop0.82
  14. 14Miami Brickell0.81
  15. 15Atlanta Midtown0.80
scripted demo, hardcoded scores per query. MMR redundancy uses geographic distance as a proxy; the production system uses embedding distance directly. Same intuition.

Why this is non-trivial

The naive answer is “embed every cell, do cosine top-K”. That fails three ways at scale:

  1. Cross-country structural similarity is not the same as “near each other”, Heidelberg and Cambridge MA share a structural fingerprint (dense walkable university core) that no geographic feature alone captures. The embedding has to learn the shared pattern despite different POI distributions, building styles, and feature densities across countries.
  2. Single-country training overfits. A US-only model treats European cells as out-of-distribution. The fix is to train jointly across countries with deliberate balancing so the smaller datasets don’t get drowned out.
  3. Top-K cosine is not enough, naive results cluster on the anchor’s own neighbourhood. You get 10 markers stacked on Manhattan instead of “10 distinct cities”. Production-quality similarity needs MMR diversity reranking, geographic deduplication, and settlement-density filtering layered on top.

My role

End-to-end. Designed the architecture, trained the embedding, stood up the infrastructure, built the query API, contributed the agent integration on the frontend. ~1 person-quarter of work.

Architecture

What “production-quality” meant in practice

Why this earns a spot in projects

This wasn’t a notebook experiment that got shelved. It runs in a private cluster, behind an MCP that an agent calls, with a frontend that demos cleanly. The architecture is reusable, the same pattern of “custom embeddings → pgvector → MCP → agentic UI” now applies to any TomTom data set we want to make searchable by similarity.

← all projects