Multimodal Emotion Recognition Using Explainable Hybrid CNN-Transformer Networks
Narote Preetham
Telangana Tribal Welfare Residential Degree College (Boys), Boath, Adilabad, Telangana, India.
Alurwad Tripat Venkatreddy
Government Degree College, Nirmal, Telangana, India.
K. Krunal Yadav
Government Degree College (Arts and Commerce), Adilabad, Telangana, India.
Ankatwar Gajanan
*
Government Degree College (Arts and Commerce), Adilabad, Telangana, India.
Gajjala Lilly Rani
Avanthi’s Scientific Technological & Research Academy, Hyderabad, Telangana, India.
*Author to whom correspondence should be addressed.
Abstract
Emotion recognition is central to affective computing because it enables intelligent systems to perceive and respond to human affective states in domains such as healthcare, education, customer service, and human–computer interaction. However, unimodal methods based solely on facial expressions, speech, or text often fail in real-world settings due to noise, ambiguity, occlusions, and limited contextual cues, while many multimodal systems rely on simplistic fusion and remain difficult to interpret. This study proposes an Explainable Hybrid CNN-Transformer Network (EHCTN) for multimodal emotion recognition that integrates visual, audio, and textual information while improving transparency. Facial frames and audio spectrograms/MFCC-based representations are encoded via CNNs to capture discriminative spatial and acoustic patterns, and text is embedded using a pre-trained BERT model to obtain contextual semantics. Modality-specific features are combined using an attention-based fusion mechanism that dynamically weights each modality to strengthen robustness under noisy or partially missing inputs, followed by Transformer layers to model long-range dependencies and cross-modal interactions; a softmax classifier predicts emotion categories (e.g., happiness, sadness, anger, fear, surprise, neutral). Explainability is incorporated using Grad-CAM to localise salient facial regions and SHAP to quantify influential features across modalities. Experiments on IEMOCAP, MELD, and CMU-MOSEI with a 70/15/15 train–validation–test split and augmentation, trained in PyTorch on NVIDIA GPUs using AdamW (learning rate 1e-4, batch size 32, 100 epochs, dropout 0.5), show that EHCTN outperforms CNN, Transformer, and CNN-LSTM baselines, achieving 87.9% accuracy, 87.3% precision, 86.9% recall, and 87.1% F1-score, with reported accuracy gains of 11.4%, 6.2%, and 4.7% over the respective baselines. Confusion-matrix analysis indicates strongest performance for the Neutral class (228 correct) and minor confusion between Sad and Angry. Grad-CAM and SHAP analyses confirm reliance on meaningful facial regions (eyes, eyebrows, mouth), speech cues (e.g., pitch variation, intensity), and salient words, supporting trustworthy deployment of robust, interpretable emotion-aware systems.
Keywords: Multimodal emotion recognition, CNN, transformer, explainable AI, attention mechanism, deep learning, affective computing.