Skip to content

The main aim of this project is to build an scable architecture which has capibility to :- Handle Real Time Data - (Kafka) Perform Machine learining on the fly on huge amount of data - (SparkML) Store large amount of data data - (Cassandra/Elastic) Visualize on Real Time - (Kibana)

License

Notifications You must be signed in to change notification settings

maitreytalware/Kafka-Machine_learning-Elastic_Kibana

Repository files navigation

Real time data handling with Kafka

Machine learning in production, with visualisation on real time

                                             - Maitrey Talware

The main aim of this project is to build an scable architecture which has capibility to :-

  1. Handle Real Time Data - (Kafka)
  2. Perform Machine learining on the fly on huge amount of data - (SparkML)
  3. Store large amount of data data - (Cassandra/Elastic)
  4. Visualize on Real Time - (Kibana)

Table of contents



1. Introduction and Objective

Conventional methods of payment are long forgotten after the emergence of Credit cards. It has been observed that 86% of the millennials use Credit cards for payments, according to a report by FICO ®. But this report has also shown that in US alone, $11 Billion dollars is the damage extent in the year 2017, by means of unauthorized credit card transactions. Since financial data is generally huge, and we need to find a meaning from the data in order to detect fraud prevention. Hence, this project will witness the conjunction of big data with machine learning in a real-time application, i.e. the transaction source needs to feed live data and we need to further channel it into the respective modules.

With the world accepting credit as a source of funds among all the sections of society, fraud detection becomes critical here. Fraudsters try to illicitly gather users' credit card information using sophisticated techniques. In order to detect out the fraudulent transactions, we use Big data coupled with machine learning. Therefore, there is a need to automate the ever-increasing demand for fraud detection of valuable credit card transactions pertaining to particular features of the user’s spending interests. The fraud detection system detects transactions that are unusual and filters out the suspected transactions and our machine learning model makes the it a self-learning system as well.


1.1 Architecture Diagram


1.2 Components of Project

1. Apache Kafka

Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real- time data feeds.

2. Apache Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. 7

3. Apache Spark ML

Spark.ml is a new package introduced in Spark 1.2, which aims to provide a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.

4. Apache Cassandra

Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous master less replication allowing low latency operations for all clients.

5. Elasticsearch

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

6. Kibana

Kibana is an open source data visualization plugin for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. Users can create bar, line and scatter plots, or pie charts and maps on top of large volumes of data.


2. Data

2.1 Creating the data

We have built our own data simulator which is produces up to 80 transactions per second. But our architecture is not limited to this data. We have simulated the data to show how it will work in real world scenario, since getting a real transaction is out of scope.


2.2 Structure of the data


3. Kafka

3.1 Installing Kafka

Here's few great articles that I found on internet to install kafka

For Mac

https://medium.com/@Ankitthakur/apache-kafka-installation-on-mac-using-homebrew-a367cdefd273

For Windows

https://dzone.com/articles/running-apache-kafka-on-windows-os

NOTE

From here on I will be putting commands for mac, commands for windows can be found from link above, they are almost similar for windows you just need to run .bat files


3.2 Starting the zookeeper Server

zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties

3.3 Starting the Kafka Server

kafka-server-start /usr/local/etc/kafka/server.properties

4. Cassandra

4.1 Installing Cassandra

Here's few great articles that I found on internet to install cassandra

For Mac

https://medium.com/@areeves9/cassandras-gossip-on-os-x-single-node-installation-of-apache-cassandra-on-mac-634e6729fad6

For Windows

NOTE

From here on I will be putting commands for mac, commands for windows can be found from link above, they are almost similar for windows you just need to run .bat files


4.2 Starting Cassandra Server

sudo cassandra -f

4.3 Putting data into Cassandra


5. ElasticSearch

5.1 Installing ElasticSearch

Download and unzip elasticsearch from - https://www.elastic.co/downloads/elasticsearch

5.2 Starting the Elastic Server

Open Terminal and cd to unzipped elasticsearch folder

For Mac

bin/elasticsearch

For Windows

bin\elasticsearch.bat

6. Spark

5.1 Installing spark


7. Kibana

7.1 Installing Kibana

Download and unzip elasticsearch from - https://www.elastic.co/downloads/kibana

7.2 Starting the Kibana Server

Open Terminal and cd to unzipped kibana folder

For Mac

bin/kibana 

For Windows

bin\kibana.bat

8. CONCLUSION AND FUTURE SCOPE

8.1 Conclusion

Credit Card transactions is one of the emerging ‘preferred-payment’ methods in the whole world, and we need to build fraud-detection that can process, protect and visualize the credit card transaction data. Our system has managed to accurately classify the transactions based on the data generated from the producer system. The input data has been made robust such that it reduces any kind of data loss. After understanding the real time processing and analysis technologies, data is stored in distributed fashion and is retrieved using in real time using index-based search method. The processing and classification needed an appropriate method for visualization to have an overview of system performance and correct user understanding. Kibana tool has helped to achieve greater understanding using visualizations.


8.2 Future Scope

Maintained Scalability

Existing Architecture stack can accommodate more producers and consumers.

Code Compatible for running on clusters

Simply changing the IP addresses of Elastic search, Kibana and Cassandra, we can run it on Cluster environment.

More Data Accomodation

Predictability and ML can be more robust by adding more advanced Features for data.


9. References

About

The main aim of this project is to build an scable architecture which has capibility to :- Handle Real Time Data - (Kafka) Perform Machine learining on the fly on huge amount of data - (SparkML) Store large amount of data data - (Cassandra/Elastic) Visualize on Real Time - (Kibana)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published