A deep learning-based framework for multi-modal emotion recognition leveraging audio and textual features, achieving 70.49% accuracy on the MELD dataset.
This project combines MobileBERT-based textual embeddings and MFCC-based audio features with custom CNN architectures for multimodal fusion.
The code is in the zip folder named MMERonMELD as a .ipynb file
Understanding human emotions through multiple modalities has numerous real-world applications, including mental health monitoring, customer sentiment analysis, education technology, and human-computer interaction.
This project proposes a multi-modal deep learning pipeline that:
- Extracts audio features using Mel-Frequency Cepstral Coefficients (MFCCs).
- Extracts text embeddings using a pre-trained MobileBERT transformer.
- Passes features through parallel CNN-based sub-models.
- Performs multimodal fusion using concatenation followed by fully connected layers.
- Classifies emotions into seven distinct classes.
- Multi-modal Fusion: Combines speech and text features for enhanced performance.
- MobileBERT Integration: Lightweight transformer for efficient textual representation.
- MFCC-based Audio Processing: Captures detailed frequency-based audio features.
- Custom CNN Architecture: Parallel sub-models for each modality.
- Optimized Training:
- Adam optimizer with a learning rate scheduler.
- Categorical cross-entropy loss.
- Batch normalization to prevent overfitting.
- High Accuracy: Achieved 70.49% validation accuracy, outperforming prior approaches.
MELD: Multimodal EmotionLines Dataset
- Derived from the TV series Friends.
- Contains 1,400+ dialogues and 13,000+ utterances.
- Includes audio, text, and visual modalities.
- Emotion categories:
Anger,Disgust,Fear,Joy,Neutral,Sadness,Surprise
- Sentiment labels: positive, negative, neutral.
- Converted
.mp4files to.wavusing PyDub. - Extracted MFCCs using Librosa.
- Generated fixed-length vectors per utterance via zero-padding.
- Used MobileBERT for generating contextual embeddings.
- Tokenized sequences, padded them, and extracted last hidden state representations.
Pipeline Summary:
- Audio Sub-network: CNN with Dense + BatchNorm layers.
- Text Sub-network: CNN with Dense + BatchNorm layers.
- Fusion Layer: Concatenation → Dense layers → Softmax classification.
| Component | Configuration |
|---|---|
| Frameworks | TensorFlow, PyTorch, Librosa, PyDub |
| Text Model | MobileBERT (pre-trained) |
| Audio Model | MFCC-based CNN |
| Fusion | Concatenation + Dense Layers |
| Optimizer | Adam + LR Scheduler |
| Loss | Categorical Cross-Entropy |
| Dataset Split | 80% train / 20% test |
| Accuracy | 70.49% |
| Method | Accuracy (%) |
|---|---|
| DialogueCRN (Hu et al., 2021) | 60.73 |
| UniMSE (Hu et al., 2022) | 65.00 |
| M2FNet (Chudasama et al., 2022) | 67.85 |
| Our Proposed Method | 70.49 |
