Real time data handling with Kafka

Machine learning in production, with visualisation on real time

                                             - Maitrey Talware

The main aim of this project is to build an scable architecture which has capibility to :-

Handle Real Time Data - (Kafka)
Perform Machine learining on the fly on huge amount of data - (SparkML)
Store large amount of data data - (Cassandra/Elastic)
Visualize on Real Time - (Kibana)

1. Introduction and Objective

Conventional methods of payment are long forgotten after the emergence of Credit cards. It has been observed that 86% of the millennials use Credit cards for payments, according to a report by FICO ®. But this report has also shown that in US alone, $11 Billion dollars is the damage extent in the year 2017, by means of unauthorized credit card transactions. Since financial data is generally huge, and we need to find a meaning from the data in order to detect fraud prevention. Hence, this project will witness the conjunction of big data with machine learning in a real-time application, i.e. the transaction source needs to feed live data and we need to further channel it into the respective modules.

With the world accepting credit as a source of funds among all the sections of society, fraud detection becomes critical here. Fraudsters try to illicitly gather users' credit card information using sophisticated techniques. In order to detect out the fraudulent transactions, we use Big data coupled with machine learning. Therefore, there is a need to automate the ever-increasing demand for fraud detection of valuable credit card transactions pertaining to particular features of the user’s spending interests. The fraud detection system detects transactions that are unusual and filters out the suspected transactions and our machine learning model makes the it a self-learning system as well.

1.1 Architecture Diagram

1.2 Components of Project

1. Apache Kafka

Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real- time data feeds.

2. Apache Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. 7

3. Apache Spark ML

Spark.ml is a new package introduced in Spark 1.2, which aims to provide a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.

4. Apache Cassandra

Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous master less replication allowing low latency operations for all clients.

5. Elasticsearch

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

6. Kibana

Kibana is an open source data visualization plugin for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. Users can create bar, line and scatter plots, or pie charts and maps on top of large volumes of data.

2. Data

2.1 Creating the data

We have built our own data simulator which is produces up to 80 transactions per second. But our architecture is not limited to this data. We have simulated the data to show how it will work in real world scenario, since getting a real transaction is out of scope.

2.2 Structure of the data

3. Kafka

3.1 Installing Kafka

Here's few great articles that I found on internet to install kafka

For Mac

https://medium.com/@Ankitthakur/apache-kafka-installation-on-mac-using-homebrew-a367cdefd273

For Windows

https://dzone.com/articles/running-apache-kafka-on-windows-os

NOTE

From here on I will be putting commands for mac, commands for windows can be found from link above, they are almost similar for windows you just need to run .bat files

3.2 Starting the zookeeper Server

zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties

3.3 Starting the Kafka Server

kafka-server-start /usr/local/etc/kafka/server.properties

4. Cassandra

4.1 Installing Cassandra

Here's few great articles that I found on internet to install cassandra

For Mac

https://medium.com/@areeves9/cassandras-gossip-on-os-x-single-node-installation-of-apache-cassandra-on-mac-634e6729fad6

For Windows

NOTE

From here on I will be putting commands for mac, commands for windows can be found from link above, they are almost similar for windows you just need to run .bat files

4.2 Starting Cassandra Server

sudo cassandra -f

4.3 Putting data into Cassandra

5. ElasticSearch

5.1 Installing ElasticSearch

Download and unzip elasticsearch from - https://www.elastic.co/downloads/elasticsearch

5.2 Starting the Elastic Server

Open Terminal and cd to unzipped elasticsearch folder

For Mac

bin/elasticsearch

For Windows

bin\elasticsearch.bat

6. Spark

5.1 Installing spark

7. Kibana

7.1 Installing Kibana

Download and unzip elasticsearch from - https://www.elastic.co/downloads/kibana

7.2 Starting the Kibana Server

Open Terminal and cd to unzipped kibana folder

For Mac

bin/kibana

For Windows

bin\kibana.bat

8. CONCLUSION AND FUTURE SCOPE

8.1 Conclusion

Credit Card transactions is one of the emerging ‘preferred-payment’ methods in the whole world, and we need to build fraud-detection that can process, protect and visualize the credit card transaction data. Our system has managed to accurately classify the transactions based on the data generated from the producer system. The input data has been made robust such that it reduces any kind of data loss. After understanding the real time processing and analysis technologies, data is stored in distributed fashion and is retrieved using in real time using index-based search method. The processing and classification needed an appropriate method for visualization to have an overview of system performance and correct user understanding. Kibana tool has helped to achieve greater understanding using visualizations.

8.2 Future Scope

Maintained Scalability

Existing Architecture stack can accommodate more producers and consumers.

Code Compatible for running on clusters

Simply changing the IP addresses of Elastic search, Kibana and Cassandra, we can run it on Cluster environment.

More Data Accomodation

Predictability and ML can be more robust by adding more advanced Features for data.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Cassandra_data_load.ipynb		Cassandra_data_load.ipynb
LICENSE		LICENSE
README.md		README.md
Real-time-credit-card-fraud-detection.ipynb		Real-time-credit-card-fraud-detection.ipynb
producer.py		producer.py

License

maitreytalware/Kafka-Machine_learning-Elastic_Kibana

Folders and files

Latest commit

History

Repository files navigation

Real time data handling with Kafka

Machine learning in production, with visualisation on real time

The main aim of this project is to build an scable architecture which has capibility to :-

Table of contents

1. Introduction and Objective

1.1 Architecture Diagram

1.2 Components of Project

1. Apache Kafka

2. Apache Spark

3. Apache Spark ML

4. Apache Cassandra

5. Elasticsearch

6. Kibana

2. Data

2.1 Creating the data

2.2 Structure of the data

3. Kafka

3.1 Installing Kafka

For Mac

For Windows

NOTE

3.2 Starting the zookeeper Server

3.3 Starting the Kafka Server

4. Cassandra

4.1 Installing Cassandra

For Mac

For Windows

NOTE

4.2 Starting Cassandra Server

4.3 Putting data into Cassandra

5. ElasticSearch

5.1 Installing ElasticSearch

Download and unzip elasticsearch from - https://www.elastic.co/downloads/elasticsearch

5.2 Starting the Elastic Server

Open Terminal and cd to unzipped elasticsearch folder

For Mac

For Windows

6. Spark

5.1 Installing spark

7. Kibana

7.1 Installing Kibana

Download and unzip elasticsearch from - https://www.elastic.co/downloads/kibana

7.2 Starting the Kibana Server

Open Terminal and cd to unzipped kibana folder

For Mac

For Windows

8. CONCLUSION AND FUTURE SCOPE

8.1 Conclusion

8.2 Future Scope

Maintained Scalability

Code Compatible for running on clusters

More Data Accomodation

9. References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages