Overview
BRI's Anti Money Laundry program required reliable data pipelines that could pull from many source systems, stage and reconcile, then deliver clean data to AML detection workloads. The pipelines needed to be repeatable, observable, and aligned with the regulatory data model.
Approach
Data modeling
- Designed staging tables in Hive to hold raw source data prior to processing.
Pipelines
- Wrote Python scripts using Spark to integrate and transform data inside Hive.
- Built automated ETL pipelines moving data from staging into AML target tables.
- Tested and debugged each pipeline path end-to-end.
Performance & observability
- Tuned Spark configurations and ETL stages for production throughput.
- Implemented logging and monitoring around ETL activity so issues surfaced fast.
Collaboration
- Worked alongside the AML team and stakeholders to make sure data semantics matched regulatory requirements.
- Produced technical documentation covering processes, architecture, and configuration.
Outcome
Repeatable, monitored pipelines that AML analysts could trust — and that operations could troubleshoot without paging the original author.