- Maitrey Talware
- Handle Real Time Data - (Kafka)
- Perform Machine learining on the fly on huge amount of data - (SparkML)
- Store large amount of data data - (Cassandra/Elastic)
- Visualize on Real Time - (Kibana)
Conventional methods of payment are long forgotten after the emergence of Credit cards. It has been observed that 86% of the millennials use Credit cards for payments, according to a report by FICO ®. But this report has also shown that in US alone, $11 Billion dollars is the damage extent in the year 2017, by means of unauthorized credit card transactions. Since financial data is generally huge, and we need to find a meaning from the data in order to detect fraud prevention. Hence, this project will witness the conjunction of big data with machine learning in a real-time application, i.e. the transaction source needs to feed live data and we need to further channel it into the respective modules.
With the world accepting credit as a source of funds among all the sections of society, fraud detection becomes critical here. Fraudsters try to illicitly gather users' credit card information using sophisticated techniques. In order to detect out the fraudulent transactions, we use Big data coupled with machine learning. Therefore, there is a need to automate the ever-increasing demand for fraud detection of valuable credit card transactions pertaining to particular features of the user’s spending interests. The fraud detection system detects transactions that are unusual and filters out the suspected transactions and our machine learning model makes the it a self-learning system as well.
Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real- time data feeds.
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. 7
Spark.ml is a new package introduced in Spark 1.2, which aims to provide a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.
Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous master less replication allowing low latency operations for all clients.
Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
Kibana is an open source data visualization plugin for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. Users can create bar, line and scatter plots, or pie charts and maps on top of large volumes of data.
We have built our own data simulator which is produces up to 80 transactions per second. But our architecture is not limited to this data. We have simulated the data to show how it will work in real world scenario, since getting a real transaction is out of scope.
Here's few great articles that I found on internet to install kafka
https://medium.com/@Ankitthakur/apache-kafka-installation-on-mac-using-homebrew-a367cdefd273
https://dzone.com/articles/running-apache-kafka-on-windows-os
From here on I will be putting commands for mac, commands for windows can be found from link above, they are almost similar for windows you just need to run .bat files
zookeeper-server-start /usr/local/etc/kafka/zookeeper.propertieskafka-server-start /usr/local/etc/kafka/server.propertiesHere's few great articles that I found on internet to install cassandra
From here on I will be putting commands for mac, commands for windows can be found from link above, they are almost similar for windows you just need to run .bat files
sudo cassandra -fDownload and unzip elasticsearch from - https://www.elastic.co/downloads/elasticsearch
bin/elasticsearchbin\elasticsearch.batDownload and unzip elasticsearch from - https://www.elastic.co/downloads/kibana
bin/kibana bin\kibana.batCredit Card transactions is one of the emerging ‘preferred-payment’ methods in the whole world, and we need to build fraud-detection that can process, protect and visualize the credit card transaction data. Our system has managed to accurately classify the transactions based on the data generated from the producer system. The input data has been made robust such that it reduces any kind of data loss. After understanding the real time processing and analysis technologies, data is stored in distributed fashion and is retrieved using in real time using index-based search method. The processing and classification needed an appropriate method for visualization to have an overview of system performance and correct user understanding. Kibana tool has helped to achieve greater understanding using visualizations.
Existing Architecture stack can accommodate more producers and consumers.
Simply changing the IP addresses of Elastic search, Kibana and Cassandra, we can run it on Cluster environment.
Predictability and ML can be more robust by adding more advanced Features for data.

