project · 2021

Enterprise data lake on GCP, from scratch

Architected an enterprise data lake on Google Cloud (BigQuery, Cloud Storage, Cloud Composer / Airflow) at Oyo Vacation Homes. Designed a metadata-driven ETL framework in PySpark that cut new-pipeline development time by ~70%. Established schema-versioning and quality-validation patterns the team still uses.

#data-engineering #bigquery #pyspark #airflow #gcp #metadata-driven #data-governance #leadership

An enterprise data lake at Oyo Vacation Homes, built on Google Cloud from a near-greenfield starting point. I led the team of data engineers that put it together over the second half of 2021.

What I built

Storage and warehouse layout: bronze (raw) on Cloud Storage, silver (cleaned, partitioned) and gold (aggregated, business-ready) on BigQuery. Same medallion model I have applied since.
Orchestration: Cloud Composer (managed Airflow). DAGs were generated, not hand-written, from metadata (see below).
Metadata-driven ETL framework: a PySpark base library where adding a new pipeline meant filling in a YAML config (source connection, destination table, transformation steps, schedule). The framework generated the DAG, the Spark job, the schema-validation step, and the quality checks. Cut new-pipeline development time by roughly 70% versus writing each one by hand.
Data governance: schema versioning baked into the framework so backwards-compatible upstream changes did not break downstream jobs. Quality-validation rules ran at the bronze→silver boundary; failures parked the data and alerted the owner instead of silently propagating.

What I led on

The team was small but cross-time-zone. I focused on three things: (1) the framework being boring enough that anyone could ship a pipeline by following the convention; (2) coding standards and review rituals that kept quality high without slowing the team; (3) regular knowledge-sharing sessions so the framework’s design choices were transparent and contributors could extend it confidently.

Why this earns a spot in projects

A short tenure with disproportionate impact. By the time I left, the team was shipping pipelines roughly 3x faster than at the start, with materially fewer downstream incidents because the framework caught schema drift and bad data at ingest. The metadata-driven pattern is one I have re-applied at every job since: the right level of abstraction is the one that makes the right thing the easy thing.

Stack

Python · PySpark · BigQuery · Cloud Storage · Cloud Composer (Apache Airflow) · GCP IAM and service accounts · YAML-based configuration.

← all projects