project · 2018-2019

PySpark ETL optimisation for Citi Bank Singapore

Optimised long-running ETL pipelines for Citi Bank Singapore (TCS engagement). Cut Data Stage job execution from 4 hours to 1 hour through Python multiprocessing + PySpark parallelisation. Designed real-time message handling with Kafka, plus AVRO / Parquet conversion and HDFS compression strategies for the bank's machine-learning data pipelines.

A year-long client engagement with Citi Bank Singapore through TCS, working on the data plumbing for several of the bank’s machine-learning use cases. Most of the work was making slow things fast and brittle things robust.

What I worked on

What I learned

This was the engagement where I learned what production data engineering actually demands: the cost of slow is not just slow, it is missed batch windows, stale features, ML models trained on yesterday’s data instead of today’s. Optimising a pipeline from 4 hours to 1 hour was not about ego; it was about whether the bank’s ML team could iterate within a working day. Once you internalise that framing, every later refactor (the API analytics ETL template, the vehicle-telemetry silver layer) starts from the same question: what does latency cost the consumer of this data?

Stack

Python · PySpark · Python multiprocessing · Apache Kafka · AVRO / Parquet · HDFS · IBM InfoSphere DataStage (legacy).

← all projects