A Review on Transformer-Based Deep Learning Models for Multimodal Emotion Recognition

Authors: Research Scholar Udaya Kumar Nanubala, Professor Dr.Pankaj Khairnar

Abstract: Emotion recognition has surely become an important research field in artificial intelligence because it can improve how humans interact with computers. Moreover, this technology helps in building better intelligent systems. Basically, traditional methods using single type of data fail to understand human emotions properly because emotions are expressed through multiple ways – text, speech, and facial expressions – all at the same time. This paper actually reviews transformer deep learning models that definitely work with different types of data for recognizing emotions. This study looks at how emotion recognition methods have changed from old rule-based and machine learning ways to new deep learning and transformer systems. As per the research, regarding emotion detection techniques, there has been clear progress from basic approaches to advanced methods. Basically, deep learning models like CNNs and RNNs have made feature extraction and pattern recognition better, but the same models struggle with long-range connections and combining different types of data. Basically, Transformer models use attention mechanisms to understand context better and make different types of data work together in the same way. As per recent studies, multimodal transformer systems improve emotion detection by combining different types of data sources into one framework. Regarding performance, this approach gives more accurate and reliable results. As per the review, different multimodal fusion techniques like early, late, and hybrid fusion strategies are analyzed regarding their role in making system performance better. Despite good progress, challenges like different data types, matching different modes, high computing needs, and limited large multimodal datasets remain critical issues that need further attention, as the field itself faces these ongoing problems. Also, this study further identifies important research gaps and emphasizes that efficient fusion mechanisms, scalable architectures, and real-world deployment strategies itself need more development. The findings give important insights for developing better emotion recognition systems that can further improve human-machine interaction itself.

DOI: https://zenodo.org/records/20074005

Related posts

Follow Us on