project
PySpark ETL optimisation for Citi Bank Singapore
Optimised long-running ETL pipelines for Citi Bank Singapore (TCS engagement). Cut Data Stage job execution from 4 hours to 1 hour through Python multiprocessing + PySpark parallelisation. Designed real-time message handling with Kafka, plus AVRO / Parquet conversion and HDFS compression strategies for the bank's machine-learning data pipelines.