REAL-TIME FINANCE ETL

Real-Time Financial ETL Pipeline

Databricks | PySpark | Spark SQL | Delta Lake | GCP

📌 Business Problem

Financial transaction data was stored in raw format and not optimized for analytics or reporting. The goal was to build a scalable ETL pipeline that transforms raw transactional data into analytics-ready datasets using modern data engineering best practices.

🛠 Tools & Technologies

Databricks
PySpark
Spark SQL
Delta Lake
Google Cloud Platform (GCP)
Medallion Architecture (Bronze/Silver/Gold)

🔄 Solution & Architecture

I designed and implemented a Medallion Architecture pipeline:

Bronze Layer

Ingested raw financial transaction data
Applied schema enforcement and basic validation
Stored immutable raw data for traceability

Silver Layer

Cleaned and standardized data
Applied deduplication and null handling
Performed data type casting and validation checks
Built structured tables optimized for transformations

Gold Layer

Created curated tables for reporting and analytics
Aggregated transaction metrics
Optimized datasets for BI tool consumption

📊 Key Outcomes

Built scalable ETL pipeline following industry-standard data architecture
Reduced transformation complexity for downstream analytics
Enabled analytics-ready structured datasets for reporting
Demonstrated end-to-end data engineering workflow from ingestion to consumption

🔗 GitHub Link

I/O Architecture Diagram

Pipeline flow:

Synthetic transaction generator writes JSON batches to GCS (Raw).
Databricks Auto Loader ingests into Bronze (streaming + checkpointing).
Silver cleans and enriches data (dates, hour, weekday, high-value flag).
Gold builds business KPIs (daily/hourly/category/top accounts/high-value table).
Databricks SQL Warehouse serves curated Gold tables to Power BI.

Data Modeling (Power BI)

Star schema with a shared DateTable dimension and Merchant Category dimension
Measures include:
- Total Transactions
- Total Spend
- High Value Transaction Count
- DoD % Change (time intelligence)