project Β· 2021

Enterprise data lake on GCP, from scratch

Architected an enterprise data lake on Google Cloud (BigQuery, Cloud Storage, Cloud Composer / Airflow) at Oyo Vacation Homes. Designed a metadata-driven ETL framework in PySpark that cut new-pipeline development time by ~70%. Established schema-versioning and quality-validation patterns the team still uses.

An enterprise data lake at Oyo Vacation Homes, built on Google Cloud from a near-greenfield starting point. I led the team of data engineers that put it together over the second half of 2021.

What I built

What I led on

The team was small but cross-time-zone. I focused on three things: (1) the framework being boring enough that anyone could ship a pipeline by following the convention; (2) coding standards and review rituals that kept quality high without slowing the team; (3) regular knowledge-sharing sessions so the framework’s design choices were transparent and contributors could extend it confidently.

Why this earns a spot in projects

A short tenure with disproportionate impact. By the time I left, the team was shipping pipelines roughly 3x faster than at the start, with materially fewer downstream incidents because the framework caught schema drift and bad data at ingest. The metadata-driven pattern is one I have re-applied at every job since: the right level of abstraction is the one that makes the right thing the easy thing.

Stack

Python Β· PySpark Β· BigQuery Β· Cloud Storage Β· Cloud Composer (Apache Airflow) Β· GCP IAM and service accounts Β· YAML-based configuration.

← all projects