- NOUIH Omar
- AIT AHMED Aymen
- BEZZAR Nouhaila
This project develops a web application for real-time sentiment analysis of Twitter streams using Apache Kafka and machine learning models. The application categorizes each tweet as Negative, Positive, Neutral, or Irrelevant.
-
Video Demonstration
enregistrement-2024-04-29-005449_zmJIDUHN.mp4
- APP: Flask application, static resources, and HTML templates.
- Spark: Machine learning and stream processing scripts, datasets.
To get Kafka running via Docker, execute the following commands:
cd path_to_project_directory
docker-compose up -d
Build the Docker image using the Dockerfile provided:
docker build -t spark-jupyter:latest .
Run the Docker container, ensuring to mount the Spark directory so that it is accessible within the container:
docker run -p 8888:8888 -v path_to_your_project_directory/Spark:/home/jovyan/work spark-jupyter:latest
This mounts the Spark
directory to /home/jovyan/work
within the Jupyter environment.
Navigate to the Spark directory and execute the training script:
spark-submit Spark-MLlib.py
Start the Kafka streaming process with the following command:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.0 KafkaSpark-Streaming.py
or
spark-submit Kafka-Streaming.py
Ensure MongoDB is operational and the URI is set up as per your main.py
, Kafka-Streaming.py
configuration.
/stream_inserts
: Endpoint for real-time stream inserts./
: Homepage and stream testing./stream
: Stream simulation interface./validation
: Model validation and predictions display./produce_tweets
: Processes and queues tweets for analysis./stream_csv
: Streams validation data for real-time analysis.
The training dataset twitter_training.csv
includes tweets with their sentiment labels. The twitter_validation.csv
dataset contains tweets used for real-time sentiment prediction.
The project utilizes Apache Kafka Stream, Spark MLlib, NLTK, Pandas, Matplotlib, Flask, and Django. It is developed in Python with supplementary use of Java and JavaScript.
Twitter Entity Sentiment Analysis data obtained from Kaggle.
There are two ways to test the streams within the application:
- Real-time text entry for immediate sentiment analysis.
- Use the "Start Streaming Tweets" button to initiate streaming from
twitter_validation.csv
.
-
KafkaSpark-Streaming.py: This script utilizes Spark Structured Streaming to consume tweets from Kafka, perform sentiment analysis using a pre-trained machine learning model, and save the results to MongoDB.
-
Structured Streaming with Kafka: The script sets up a structured streaming process to consume real-time tweets from Kafka topics. It integrates Kafka as a source for streaming data, leveraging Spark's structured streaming API for real-time data processing.
-
Sentiment Analysis: After consuming tweets, the script applies sentiment analysis using a pre-trained machine learning model. It tokenizes, preprocesses, and transforms the tweets into feature vectors before making sentiment predictions.
-
MongoDB Integration: Results from the sentiment analysis are then stored in MongoDB for further analysis or retrieval by downstream applications.
-
-
KafkaProducer-Streaming.py: This script collects tweets in real-time using the Twitter API and produces them to a Kafka topic for further processing.
-
Producer-Consumer Architecture: The script serves as a Kafka producer, continuously fetching tweets from the Twitter API and publishing them to a specified Kafka topic. It follows the producer-consumer pattern, where it acts as a data source for downstream consumers, such as the KafkaSpark-Streaming.py script.
-
Real-Time Data Collection: Utilizing the Twitter API, the script collects tweets in real-time, ensuring a continuous stream of data for downstream processing.
-
The source code is available at [GitHub repository link].