project · 2021
Enterprise data lake on GCP, from scratch
Architected an enterprise data lake on Google Cloud (BigQuery, Cloud Storage, Cloud Composer / Airflow) at Oyo Vacation Homes. Designed a metadata-driven ETL framework in PySpark that cut new-pipeline development time by ~70%. Established schema-versioning and quality-validation patterns the team still uses.
An enterprise data lake at Oyo Vacation Homes, built on Google Cloud from a near-greenfield starting point. I led the team of data engineers that put it together over the second half of 2021.
What I built
- Storage and warehouse layout: bronze (raw) on Cloud Storage, silver (cleaned, partitioned) and gold (aggregated, business-ready) on BigQuery. Same medallion model I have applied since.
- Orchestration: Cloud Composer (managed Airflow). DAGs were generated, not hand-written, from metadata (see below).
- Metadata-driven ETL framework: a PySpark base library where adding a new pipeline meant filling in a YAML config (source connection, destination table, transformation steps, schedule). The framework generated the DAG, the Spark job, the schema-validation step, and the quality checks. Cut new-pipeline development time by roughly 70% versus writing each one by hand.
- Data governance: schema versioning baked into the framework so backwards-compatible upstream changes did not break downstream jobs. Quality-validation rules ran at the bronze→silver boundary; failures parked the data and alerted the owner instead of silently propagating.
What I led on
The team was small but cross-time-zone. I focused on three things: (1) the framework being boring enough that anyone could ship a pipeline by following the convention; (2) coding standards and review rituals that kept quality high without slowing the team; (3) regular knowledge-sharing sessions so the framework’s design choices were transparent and contributors could extend it confidently.
Why this earns a spot in projects
A short tenure with disproportionate impact. By the time I left, the team was shipping pipelines roughly 3x faster than at the start, with materially fewer downstream incidents because the framework caught schema drift and bad data at ingest. The metadata-driven pattern is one I have re-applied at every job since: the right level of abstraction is the one that makes the right thing the easy thing.
Stack
Python · PySpark · BigQuery · Cloud Storage · Cloud Composer (Apache Airflow) · GCP IAM and service accounts · YAML-based configuration.