project · 2026
Multi-country geospatial similarity engine
A geospatial similarity engine that learns the structural fingerprint of a place from OpenStreetMap features and returns 'what other places are like this?' across multiple countries. Custom-trained vision-transformer embeddings, pgvector, MCP server, agent-driven map UI. Built end-to-end.
A geospatial similarity engine that answers cross-country questions like “places like Heidelberg in the US” or “university towns anywhere on the continent”. Built end-to-end, model training, infrastructure, query layer, and the agent-facing UI.
Try the query knobs
The widget below mirrors the production query layer. Hardcoded scores, no live backend, but the three knobs (K, MMR diversity, min cosine) drive the result list and map pins exactly the way the real system does. Drag any of them.
top 15 (29 eligible)
- 1Hell's Kitchen, NYC0.97
- 2Midtown East, NYC0.96
- 3Greenwich Village, NYC0.95
- 4Lower Manhattan, NYC0.94
- 5Union Square, NYC0.94
- 6Upper West Side, NYC0.92
- 7Williamsburg, Brooklyn0.90
- 8DC Downtown0.86
- 9Boston Back Bay0.85
- 10Cambridge MA0.84
- 11Philly Center City0.83
- 12SF Financial District0.83
- 13Chicago Loop0.82
- 14Miami Brickell0.81
- 15Atlanta Midtown0.80
Why this is non-trivial
The naive answer is “embed every cell, do cosine top-K”. That fails three ways at scale:
- Cross-country structural similarity is not the same as “near each other”, Heidelberg and Cambridge MA share a structural fingerprint (dense walkable university core) that no geographic feature alone captures. The embedding has to learn the shared pattern despite different POI distributions, building styles, and feature densities across countries.
- Single-country training overfits. A US-only model treats European cells as out-of-distribution. The fix is to train jointly across countries with deliberate balancing so the smaller datasets don’t get drowned out.
- Top-K cosine is not enough, naive results cluster on the anchor’s own neighbourhood. You get 10 markers stacked on Manhattan instead of “10 distinct cities”. Production-quality similarity needs MMR diversity reranking, geographic deduplication, and settlement-density filtering layered on top.
My role
End-to-end. Designed the architecture, trained the embedding, stood up the infrastructure, built the query API, contributed the agent integration on the frontend. ~1 person-quarter of work.
Architecture
- Encoder: Masked-Autoencoder Vision Transformer trained on per-cell OSM-derived feature grids (POI categories, road density, transit, water, terrain). Country-balanced sampling at training time so smaller countries get represented fairly.
- Index: PostgreSQL with pgvector, IVFFlat for top-K, partial indexes filtered on
is_activeso unpopulated cells never enter the candidate pool. - Pre-computation: distance-to-nearest column per proximity feature (river, transit, park, peak, etc.) computed once at load via KD-tree. Lets queries like “places near a river” run as a cheap indexed lookup instead of an O(N) scan.
- Settlement density: at load time, count of active neighbours within 3 km is stored per cell. Drops isolated outliers (lone gas stations, highway exits) from results.
- Query layer: FastMCP server exposing
anchor_similarity,concept_similarity,contrastive_similarity, plus raw-feature lookup and health. Each tool acceptsdiversity(MMR),min_separation_km(geographic dedup), and a layered region filter. - Frontend: Maps SDK + Agent Toolkit. The LLM picks exemplars, tunes diversity, and scopes by region based on a natural-language query.
- Infra: AKS with workload identity, Postgres Flexible Server with pgvector, Azure Container Registry behind a private endpoint, GitHub Actions CI/CD with OIDC federation.
What “production-quality” meant in practice
- Validated cross-country generalisation quantitatively before deploying, POI density regression R² > 0.9 on held-out cells in every country, not just the dominant one.
- Per-country breakdown of validation metrics so we’d catch the failure mode “USA looks great, Europe is garbage” early. It would have been easy to ship without this and pretend it worked.
- Default-on quality knobs at query time (geographic min-separation, settlement-density floor) so users don’t have to know about them to get good results.
Why this earns a spot in projects
This wasn’t a notebook experiment that got shelved. It runs in a private cluster, behind an MCP that an agent calls, with a frontend that demos cleanly. The architecture is reusable, the same pattern of “custom embeddings → pgvector → MCP → agentic UI” now applies to any TomTom data set we want to make searchable by similarity.