This project demonstrates real-time financial transaction processing and monitoring using Apache Kafka, Apache Spark, and ELK Stack. It is designed to simulate high-velocity financial transactions, process them in real-time, and provide monitoring and logging capabilities.
Note: This is a personal project running on limited system resources and is not deployed in a cloud environment. Despite these constraints, it effectively showcases a scalable, production-style architecture.
- Generates financial transactions (credit card, PayPal, bank transfers).
- Sends transactions to Kafka at a high velocity.
- Processes them in real-time using Spark Streaming to:
- Aggregate transaction amounts per merchant.
- Detect anomalies such as potential fraud.
- Stores and monitors data using Prometheus, Grafana, and ELK Stack.
- Banks, fintech companies, and stock markets handle millions of transactions per second.
- Fraud detection in real-time is critical to prevent financial losses.
- Monitoring system health ensures stability and prevents transaction failures.
Component | Technology Used | Purpose |
---|---|---|
Data Producer | Python and Java (Kafka Producer API) | Generates high-speed financial transactions. |
Message Broker | Apache Kafka (Docker) | Stores and distributes transactions across brokers. |
Stream Processing | Apache Spark (Python) | Aggregates transactions and detects fraud. |
Monitoring | Prometheus and Grafana | Tracks Kafka, Spark, and system performance. |
Logging | ELK Stack (Elasticsearch, Logstash, Kibana) | Stores logs for debugging and insights. |
- Generates synthetic transactions (approximately 1.2M+ per hour).
- Uses Kafka Producer API to publish messages to the
financial_transactions
topic. - Runs in parallel producer threads (
producer_data_in_parallel(3)
).
- Generates synthetic transactions (approximately 150k per second).
- Alternative producer written in Java for high throughput.
- Uses
ExecutorService
for concurrent transaction publishing.
Column | Data Type | Size (bytes) | Description |
---|---|---|---|
transactionId | STRING | 36 | Unique identifier for each transaction. |
userId | STRING | 12 | Represents the user making the transaction. |
amount | DOUBLE | 8 | Transaction amount (randomized). |
transactionTime | LONG | 8 | UNIX timestamp of transaction. |
merchantId | STRING | 12 | Merchant receiving payment. |
transactionType | STRING | 8 | "purchase" or "refund". |
location | STRING | 12 | Location of transaction. |
paymentMethod | STRING | 15 | "credit_card", "paypal", "bank_transfer". |
isInternational | BOOLEAN | 5 | Whether the transaction is international. |
currency | STRING | 5 | Currency code (USD, EUR, GBP, etc.). |
Total: 120 bytes per transaction.
- If generating 1.2 billion records per hour using the Java producer, the total size will be approximately 216 GB (1.2 billion * 120 bytes).
- Ensures fault tolerance with replication.
- Partitions data for parallel processing.
Topic Name | Partitions | Replication Factor | Retention |
---|---|---|---|
financial_transactions | 5 | 3 | 7 days |
transaction_aggregates | 3 | 3 | 7 days |
transaction_anomalies | 3 | 3 | 7 days |
- Reads transaction data from Kafka in real-time.
- Parses raw JSON messages into structured data.
- Performs:
- Aggregation → Computes total transaction volume per merchant.
- Anomaly Detection → Flags high-frequency transactions as potential fraud.
- Writes processed data back to Kafka.
Stream Process | Input Topic | Output Topic | Checkpoint Directory |
---|---|---|---|
Transaction Aggregation | financial_transactions | transaction_aggregates | /mnt/spark-checkpoints/aggregates |
Anomaly Detection | financial_transactions | transaction_anomalies | /mnt/spark-checkpoints/anomalies |
- Collects metrics from Kafka and Spark.
- Tracks:
- Kafka broker health and lag.
- Number of transactions processed per second.
- Spark job execution metrics.
- Displays real-time dashboards for monitoring.
- Elasticsearch: Stores logs from Kafka and Spark.
- Logstash: Collects, filters, and processes logs.
- Kibana: Provides a searchable log dashboard.
- Kafka transaction logs.
- Spark processing logs.
- System performance logs.
Since this is a personal project, certain limitations exist:
- Limited hardware resources → Running multiple Kafka brokers and Spark nodes on a single machine restricts performance.
- No cloud deployment → A production system would typically use AWS, Azure, or GCP.
- Real-time transaction processing with Kafka and Spark.
- High-speed message handling with Kafka brokers.
- System monitoring via Prometheus and Grafana.
- Log analysis with the ELK Stack.
Despite hardware constraints, this project tries to demonstrates a production-style architecture for high-performance financial transaction processing.
This project was referenced by the concepts and techniques demonstrated in the following resource:
- YouTube Tutorial: 1.2 Billion Records Per Hour High Performance Kafka and Spark - End to End Data Engineering Project
This implementation is a customized version based on my learning from this tutorial, adapted to fit my own system constraints and architecture.