project · 2023-2024

PPDA Data Validator, schema-aware validation for analytics pipelines

A Pydantic-based data-validation library for the team that runs API analytics. Engineers point it at a JSON event, it generates a schema, they edit it to add field-level rules (null, range, allowed-values, custom), and the orchestrator runs validations every ingest. Distributed as an installable wheel.

#data-engineering #python #pydantic #validation #data-quality

A Python library I built so any engineer on the API analytics team could add validation to a new event type in under an hour, without rolling their own Pydantic schema by hand.

What it does

Three steps:

Generate a schema from a sample JSON event. Library reads the event and emits a Pydantic-flavoured schema file with placeholder rules.
Engineer edits the schema to add field-level rules. The library supports three out of the box:
- Null check: whether a field is allowed to be null.
- Range check: numeric [lo, hi] bounds.
- Allowed-values check: whitelist for enumerated fields. Custom rules drop in as Python functions when a built-in is not enough.
Run the orchestrator with the input dataset. It applies the schema + custom validators, returns a structured pass/fail report.

Why this matters for analytics teams

Schema drift in upstream telemetry is a silent killer for downstream dashboards. The team was discovering bad data via “the dashboard looks weird” rather than at ingest. The validator pushes detection upstream: ingest the event, validate, fail loud. The schema files become living documentation for what each event is supposed to look like.

Stack

Python, Pydantic for the runtime types.
CLI for schema generation; library API for embedding in larger pipelines.
Distributed internally as a .whl so consumer teams just pip install and import.

Why this earns a spot in projects

It is a small library, but it is the kind of thing that quietly lifts the floor of a whole team. Six months after I shipped it, every new event type at the team got a schema written before the data even started flowing, because that was the path of least resistance. That is the bar: make the right thing the easy thing.

← all projects