project · 2021

Enterprise data lake on GCP, from scratch

Architected an enterprise data lake on Google Cloud (BigQuery, Cloud Storage, Cloud Composer / Airflow) at Oyo Vacation Homes. Designed a metadata-driven ETL framework in PySpark that cut new-pipeline development time by ~70%. Established schema-versioning and quality-validation patterns the team still uses.

An enterprise data lake at Oyo Vacation Homes, built on Google Cloud from a near-greenfield starting point. I led the team of data engineers that put it together over the second half of 2021.

What I built

What I led on

The team was small but cross-time-zone. I focused on three things: (1) the framework being boring enough that anyone could ship a pipeline by following the convention; (2) coding standards and review rituals that kept quality high without slowing the team; (3) regular knowledge-sharing sessions so the framework’s design choices were transparent and contributors could extend it confidently.

Why this earns a spot in projects

A short tenure with disproportionate impact. By the time I left, the team was shipping pipelines roughly 3x faster than at the start, with materially fewer downstream incidents because the framework caught schema drift and bad data at ingest. The metadata-driven pattern is one I have re-applied at every job since: the right level of abstraction is the one that makes the right thing the easy thing.

Stack

Python · PySpark · BigQuery · Cloud Storage · Cloud Composer (Apache Airflow) · GCP IAM and service accounts · YAML-based configuration.

← all projects