Cosine top-K is doing the right math but the wrong job. Here is the gotcha, the fix (MMR), and a second fix on top of that (geographic deduplication) that production systems actually need.
Most agent evals measure the wrong thing. Output quality alone misses 80% of production failure modes. The five eval categories I now run on every agentic system, the LLM-as-judge gotchas, and how to ship a useful eval harness in a week.
Fixed-size chunking is fine for blog posts and dangerous for technical docs. Six chunking strategies, the failure modes I hit with each, and the hybrid approach I now reach for when retrieval quality matters.
Tool naming, description writing, argument schemas, error envelopes, idempotency. Lessons from authoring two production MCP servers, with the patterns I now reach for and the mistakes I've stopped making.
Counter-narrative to 'agents everywhere'. Multi-agent systems pay a real latency, complexity, and observability cost. Here's the decision framework I use to decide whether to split, and three concrete cases where a single call beats N agents in a graph.
Part 1 of a 2-post series on streaming-data architectures. Lambda architecture solved the right problem in 2014: how to combine batch correctness with stream-low-latency. The cost was running two pipelines for the same logic. Here's why it was right then, and why most teams shouldn't pick it now.
Part 2 of a 2-post series. Kappa architecture is what you get when you ask 'what if we just did everything as a stream and replayed from the event log when we need to reprocess?' One pipeline, no diverged-codebase pain. Here's how it works, where it shines, and where it still bites.
After three years of building and refactoring medallion data lakes, here's the opinionated rule set that holds up: what bronze should and should not do, what makes silver actually queryable, and how to keep gold from drifting into chaos.
From batch to streaming, the gotchas. Watermarks that drop too aggressively, late events that get silently lost, stateful aggregations that grow without bound, and the four operational habits that keep streaming jobs healthy in production.
Why your Spark job has 200 partitions even when your data has 5 GB. How to pick a target partition size, when to repartition vs coalesce, when AQE saves you, and the diagnostic loop I run on every slow job now.
Architectural pattern for a multi-tenant data-quality platform across many teams. Why centralising contracts beats centralising data, and how the SDK + control plane + fact/dim store pattern works regardless of which tools you reach for.
Five MERGE patterns that solved real problems for me: idempotent upsert, soft delete, late-arriving data, deduplication on ingest, and the slowly-changing-dimension type-2 case. Plus the performance gotchas that bit me first.
How to know your job is skewed (the Spark UI lies more than you think), and the three fix patterns I reach for in production: salt-and-aggregate, broadcast join, and AQE-driven dynamic shuffle.
Part 1 of a 3-post series. Decorators are easy to misuse and spectacular when they fit. Here are the four shapes I reach for in production AI/data work, with code: retry, timing, instrumentation, and feature-flagging.
Part 2 of a 3-post series. Context managers earn their keep when you have a resource that must be set up and torn down — and `with open()` is barely scratching the surface. Five patterns: timing, transactions, async, ExitStack, and supressing exceptions.
Final part of the series. Generators turn an O(N) memory problem into an O(1) memory problem and let you compose pipelines that read like sentences. Five patterns: streaming files, paginated APIs, itertools chains, batching, and async generators.
Three Python data libraries, three personalities. pandas is the API everyone knows. Polars is faster and stricter. DuckDB is the database masquerading as a library. Here's the decision framework I use, with the cases where each is genuinely the right pick.
Three tools that replaced six in my Python workflow. uv subsumes pip + venv + pip-tools + build + twine. ruff subsumes flake8 + black + isort + pyupgrade. pyright (or basedpyright) makes type checking instant. Here's how to migrate.
Pydantic v2 isn't just dataclasses with validation, it's a contract layer for any boundary where untrusted data crosses into your code: HTTP, JSON files, LLM tool outputs, queue messages. Here are the patterns I reach for.
Pattern matching, structural type generics, exception groups, TaskGroup, the f-string self-doc =, and a couple of less-loved features. Code samples for each, with the use-cases I reach for them in real work.
The mental model behind asyncio that finally clicked, the production patterns I reach for (TaskGroup, Semaphore, run_in_executor), and the four mistakes that bite every async codebase eventually.