
Real-Time Financial ETL Pipeline
Databricks | PySpark | Spark SQL | Delta Lake | GCP
📌 Business Problem
Financial transaction data was stored in raw format and not optimized for analytics or reporting. The goal was to build a scalable ETL pipeline that transforms raw transactional data into analytics-ready datasets using modern data engineering best practices.
🛠 Tools & Technologies
-
Databricks
-
PySpark
-
Spark SQL
-
Delta Lake
-
Google Cloud Platform (GCP)
-
Medallion Architecture (Bronze/Silver/Gold)
🔄 Solution & Architecture
I designed and implemented a Medallion Architecture pipeline:
Bronze Layer
-
Ingested raw financial transaction data
-
Applied schema enforcement and basic validation
-
Stored immutable raw data for traceability
Silver Layer
-
Cleaned and standardized data
-
Applied deduplication and null handling
-
Performed data type casting and validation checks
-
Built structured tables optimized for transformations
Gold Layer
-
Created curated tables for reporting and analytics
-
Aggregated transaction metrics
-
Optimized datasets for BI tool consumption
📊 Key Outcomes
-
Built scalable ETL pipeline following industry-standard data architecture
-
Reduced transformation complexity for downstream analytics
-
Enabled analytics-ready structured datasets for reporting
-
Demonstrated end-to-end data engineering workflow from ingestion to consumption
I/O Architecture Diagram

Pipeline flow:
-
Synthetic transaction generator writes JSON batches to GCS (Raw).
-
Databricks Auto Loader ingests into Bronze (streaming + checkpointing).
-
Silver cleans and enriches data (dates, hour, weekday, high-value flag).
-
Gold builds business KPIs (daily/hourly/category/top accounts/high-value table).
-
Databricks SQL Warehouse serves curated Gold tables to Power BI.
Data Modeling (Power BI)
-
Star schema with a shared DateTable dimension and Merchant Category dimension
-
Measures include:
-
Total Transactions
-
Total Spend
-
High Value Transaction Count
-
DoD % Change (time intelligence)
-






