project · 2023-2024

PPDA Data Validator, schema-aware validation for analytics pipelines

A Pydantic-based data-validation library for the team that runs API analytics. Engineers point it at a JSON event, it generates a schema, they edit it to add field-level rules (null, range, allowed-values, custom), and the orchestrator runs validations every ingest. Distributed as an installable wheel.

A Python library I built so any engineer on the API analytics team could add validation to a new event type in under an hour, without rolling their own Pydantic schema by hand.

What it does

Three steps:

  1. Generate a schema from a sample JSON event. Library reads the event and emits a Pydantic-flavoured schema file with placeholder rules.
  2. Engineer edits the schema to add field-level rules. The library supports three out of the box:
    • Null check: whether a field is allowed to be null.
    • Range check: numeric [lo, hi] bounds.
    • Allowed-values check: whitelist for enumerated fields. Custom rules drop in as Python functions when a built-in is not enough.
  3. Run the orchestrator with the input dataset. It applies the schema + custom validators, returns a structured pass/fail report.

Why this matters for analytics teams

Schema drift in upstream telemetry is a silent killer for downstream dashboards. The team was discovering bad data via “the dashboard looks weird” rather than at ingest. The validator pushes detection upstream: ingest the event, validate, fail loud. The schema files become living documentation for what each event is supposed to look like.

Stack

Why this earns a spot in projects

It is a small library, but it is the kind of thing that quietly lifts the floor of a whole team. Six months after I shipped it, every new event type at the team got a schema written before the data even started flowing, because that was the path of least resistance. That is the bar: make the right thing the easy thing.

← all projects