project · 2018-2019

PySpark ETL optimisation for Citi Bank Singapore

Optimised long-running ETL pipelines for Citi Bank Singapore (TCS engagement). Cut Data Stage job execution from 4 hours to 1 hour through Python multiprocessing + PySpark parallelisation. Designed real-time message handling with Kafka, plus AVRO / Parquet conversion and HDFS compression strategies for the bank's machine-learning data pipelines.

#data-engineering #pyspark #kafka #hdfs #etl-optimisation #banking

A year-long client engagement with Citi Bank Singapore through TCS, working on the data plumbing for several of the bank’s machine-learning use cases. Most of the work was making slow things fast and brittle things robust.

What I worked on

ETL optimisation. Citi’s Data Stage jobs were running for hours per cycle, blocking downstream ML training. Re-implemented the heaviest stages in Python + PySpark with explicit multiprocessing on the bottleneck steps, taking a representative job from a 4-hour wall-clock down to ~1 hour. Same correctness, materially shorter critical path.
Storage and format choices. Converted on-disk data layouts to AVRO and Parquet where appropriate, applied compression policies tuned to access patterns. The shared HDFS cluster was the constrained resource, so storage discipline mattered as much as compute discipline.
Real-time data handling. Built Python consumers and producers against the bank’s Kafka topics to surface real-time signals into the ML training and serving paths.
Query performance. Profiling and tuning of long-running queries against the data warehouse, partition strategies, predicate pushdown into the storage layer.

What I learned

This was the engagement where I learned what production data engineering actually demands: the cost of slow is not just slow, it is missed batch windows, stale features, ML models trained on yesterday’s data instead of today’s. Optimising a pipeline from 4 hours to 1 hour was not about ego; it was about whether the bank’s ML team could iterate within a working day. Once you internalise that framing, every later refactor (the API analytics ETL template, the vehicle-telemetry silver layer) starts from the same question: what does latency cost the consumer of this data?

Stack

Python · PySpark · Python multiprocessing · Apache Kafka · AVRO / Parquet · HDFS · IBM InfoSphere DataStage (legacy).

← all projects