tag

#pyspark

6 items · 6 projects

projects

project

Vehicle-telemetry silver-layer ETL

Refactored a vehicle-telemetry processing pipeline. Transforms raw nested JSON / protobuf from millions of in-car navigation clients into clean Delta tables that power navigation-quality dashboards. Designed for query simplicity: PMs answer 'route success rate in country X this week' with a single SELECT.

project

Enterprise data lake on GCP, from scratch

Architected an enterprise data lake on Google Cloud (BigQuery, Cloud Storage, Cloud Composer / Airflow) at Oyo Vacation Homes. Designed a metadata-driven ETL framework in PySpark that cut new-pipeline development time by ~70%. Established schema-versioning and quality-validation patterns the team still uses.

project

PySpark ETL optimisation for Citi Bank Singapore

Optimised long-running ETL pipelines for Citi Bank Singapore (TCS engagement). Cut Data Stage job execution from 4 hours to 1 hour through Python multiprocessing + PySpark parallelisation. Designed real-time message handling with Kafka, plus AVRO / Parquet conversion and HDFS compression strategies for the bank's machine-learning data pipelines.

← all tags