project Β· 2021
Enterprise data lake on GCP, from scratch
Architected an enterprise data lake on Google Cloud (BigQuery, Cloud Storage, Cloud Composer / Airflow) at Oyo Vacation Homes. Designed a metadata-driven ETL framework in PySpark that cut new-pipeline development time by ~70%. Established schema-versioning and quality-validation patterns the team still uses.
An enterprise data lake at Oyo Vacation Homes, built on Google Cloud from a near-greenfield starting point. I led the team of data engineers that put it together over the second half of 2021.
What I built
- Storage and warehouse layout: bronze (raw) on Cloud Storage, silver (cleaned, partitioned) and gold (aggregated, business-ready) on BigQuery. Same medallion model I have applied since.
- Orchestration: Cloud Composer (managed Airflow). DAGs were generated, not hand-written, from metadata (see below).
- Metadata-driven ETL framework: a PySpark base library where adding a new pipeline meant filling in a YAML config (source connection, destination table, transformation steps, schedule). The framework generated the DAG, the Spark job, the schema-validation step, and the quality checks. Cut new-pipeline development time by roughly 70% versus writing each one by hand.
- Data governance: schema versioning baked into the framework so backwards-compatible upstream changes did not break downstream jobs. Quality-validation rules ran at the bronzeβsilver boundary; failures parked the data and alerted the owner instead of silently propagating.
What I led on
The team was small but cross-time-zone. I focused on three things: (1) the framework being boring enough that anyone could ship a pipeline by following the convention; (2) coding standards and review rituals that kept quality high without slowing the team; (3) regular knowledge-sharing sessions so the frameworkβs design choices were transparent and contributors could extend it confidently.
Why this earns a spot in projects
A short tenure with disproportionate impact. By the time I left, the team was shipping pipelines roughly 3x faster than at the start, with materially fewer downstream incidents because the framework caught schema drift and bad data at ingest. The metadata-driven pattern is one I have re-applied at every job since: the right level of abstraction is the one that makes the right thing the easy thing.
Stack
Python Β· PySpark Β· BigQuery Β· Cloud Storage Β· Cloud Composer (Apache Airflow) Β· GCP IAM and service accounts Β· YAML-based configuration.