This repository contains a complete deep learning pipeline for sentiment analysis with a focus on sarcasm detection in Reddit posts. It includes implementations of both classical and modern machine learning models:
- SGD Classifier (Scikit-learn)
- Bi-directional LSTM
- Custom Transformer Encoder
- Transfer Learning with DeBERTa v3 (Hugging Face)
All development was conducted in Google Colab due to local hardware limitations. Later, the codebase was modularized and tested for portability.
- Source: Kaggle - Sarcasm Detection Dataset
- Balanced Reddit dataset including:
- Post text
- Author scores
- Timestamps
- Labels (sarcastic / non-sarcastic)
- Parent comments
Split:
- 1M total samples split into 800K train / 100K validation / 100K test
Initial correlation analyses between features and sarcasm labels revealed that only the main comment and parent comment were useful for classification.
| Model | Input Used | Accuracy | Notes |
|---|---|---|---|
| SGD Classifier (TF-IDF) | Main comment | 67.16% | Classical baseline using TF-IDF; fast but limited context handling |
| Bi-LSTM | Main comment | 72.91% | Custom vectorizer + simple LSTM; strong trade-off of speed vs. accuracy |
| Bi-LSTM (Dual Input) | Main + parent comment | 73.61% | Context-aware; better than using main comment alone |
| Bi-LSTM + Emojis | Main + parent + emoji-aware vocab | ~74.5% | Slight improvement; kept emojis for future scalability |
| Custom Transformer | Main + parent (with [SEP] token) |
70.00% | High training cost; didnt' outperform LSTM due to data limitations |
| DeBERTa (Frozen) | Main + parent | 57.8% | All layers frozen; underfitting likely |
| DeBERTa (Half Frozen) | Main + parent (small data) | 76.0% | Strong improvement from partial unfreezing |
| DeBERTa (Half + 800K) | Main + parent (full data) | 78.0% | Best accuracy, but long training time (1+ day) |
|-- configs/ # Config files to setup models and training
|-- src/ # main folder w
|-- dataset/ # Datset processor and vectoriser script
|-- lr_shedule/ # contains custom WarmupCosine LR sheduler script
|-- models/ # model definitions
|-- trainer/ # Training script
|-- callbacks.py
|-- data_process.py
|-- metrics.py
|-- utils.py # Helper functions (vectorizers, metrics, etc.)
|-- train.py # training function (used config as input to choose the model and perform the trainig)
|-- README.md # Project overviewFollowing the Scikit-learn ML Cheat Sheet, a TF-IDF vectorizer was used to encode text, and an SGDClassifier was applied.
Results:
- Accuracy: 67.16%
- Precision: 68.87%
- Recall: 67.16%
- F1 Score: 66.38%
- Custom vectorizer (similar to TensorFlow's TextVectorization)
- Vocabulary built from training data
- Padding applied to match
max_len(max_len was found as 98 percentile of len of the trianing data)
- Embedding layer --> Bi-directional LSTM --> Linear layer with Dropout
- Loss: CrossEntropy
- Optimizer: Adam (lr =
1e-4, weight decay =1e-5)
Result:
Validation Accuracy: 72.91%
- Emoji Vocabulary: Added basic emojis to vocabulary --> +1% accuracy (kept for future use).
- Contextual Input: Combined main + parent comments. Two parallel Bi-LSTMs processed each, followed by concatenation --> Accuracy improved to 73.61%
- GloVe Embeddings: Tried but yielded no improvement, likely due to Reddit's (or general posts) unique language structure.
Training/validation loss and accuracy for bi-LSTM model without parent comments
Training/validation loss and accuracy for bi-LSTM model with parent comments + emojis
We observed that the validation loss plateaued around epoch 17 for the model trained without parent comments, and around epoch 13 when parent comments were included, while training loss continued to decrease. This behaviour indicates overfitting. Therefore, early stopping was applied to preserve generalisation.
The model could likely be improved further by tuning the embedding dimension or experimenting with a simpler LSTM architecture. These optimisations were left for future work. However, we estimate that such tuning would yield only a modest gain of 2–3% in accuracy, which may not justify the additional complexity in this context.
- Input:
[main comment] [SEP] [parent comment] - Text tokenized with extended vectorizer to include
[SEP] - Embedded + positional encodings --> Transformer Encoder block
Encoder Block:
- Multi-head self-attention (12 heads)
- 2 encoder layers
- LayerNorm + Dropout + 2 Dense layers (256 units, ReLU activation)
Training:
- Optimizer: AdamW (weight decay =
2e-4) - Learning rate scheduler: Warmup with cosine decay
- Gradient clipping to prevent exploding gradients
Result:
Validation Accuracy: 70%
(Training time: ~5 hours for 20 epochs --> 10x longer than LSTM for lower accuracy)
Training/validation loss and accuracy for custom Transformer model with parent comments + emojis
The model begins to overfit after epoch 11, as indicated by the rising validation loss while training accuracy continues to improve. Therefore, early stopping with a patience of 3 was applied, resulting in a final validation accuracy of approximately 70%.
DeBERTa v3 small was used for fine-tuning:
| Strategy | Validation Accuracy |
|---|---|
| Freeze all layers (small data 100k) | 57.8% (4epochs) |
| Freeze half layers (small data 100k) | 76.0% (4epochs) |
| Freeze half layers (800K train samples) | 78.0% (1 epoch) |
While DeBERTa provides the best accuracy, it's significantly more resource-intensive. Training will probably will take a day or two (1 epoch is 4 hours with T4 GPU), and the improvement (~4.5%) over Bi-LSTM may not justify the cost for most applications.
Recommendation:
Unless maximum precision is critical, the Bi-LSTM offers a great trade-off between performance and training efficiency for deployment (20 epochs is 1 hour training)
- Hyperparameter optimization for Transformer model
- Experimenting with additional metadata (author score, timestamps)
- Ensemble methods combining classical and neural models
- Deployment-ready export for mobile/real-time usage



