GBV-Net: Hierarchical Fusion of Facial Expressions and Physiological Signals for Multimodal Emotion Recognition. 2025

Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
School of Information Engineering, Zhejiang Ocean University, Zhoushan 316022, China.

A core challenge in multimodal emotion recognition lies in the precise capture of the inherent multimodal interactive nature of human emotions. Addressing the limitation of existing methods, which often process visual signals (facial expressions) and physiological signals (EEG, ECG, EOG, and GSR) in isolation and thus fail to exploit their complementary strengths effectively, this paper presents a new multimodal emotion recognition framework called the Gated Biological Visual Network (GBV-Net). This framework enhances emotion recognition accuracy through deep synergistic fusion of facial expressions and physiological signals. GBV-Net integrates three core modules: (1) a facial feature extractor based on a modified ConvNeXt V2 architecture incorporating lightweight Transformers, specifically designed to capture subtle spatio-temporal dynamics in facial expressions; (2) a hybrid physiological feature extractor combining 1D convolutions, Temporal Convolutional Networks (TCNs), and convolutional self-attention mechanisms, adept at modeling local patterns and long-range temporal dependencies in physiological signals; and (3) an enhanced gated attention fusion module capable of adaptively learning inter-modal weights to achieve dynamic, synergistic integration at the feature level. A thorough investigation of the publicly accessible DEAP and MAHNOB-HCI datasets reveals that GBV-Net surpasses contemporary methods. Specifically, on the DEAP dataset, the model attained classification accuracies of 95.10% for Valence and 95.65% for Arousal, with F1-scores of 95.52% and 96.35%, respectively. On MAHNOB-HCI, the accuracies achieved were 97.28% for Valence and 97.73% for Arousal, with F1-scores of 97.50% and 97.74%, respectively. These experimental findings substantiate that GBV-Net effectively captures deep-level interactive information between multimodal signals, thereby improving emotion recognition accuracy.

UI MeSH Term Description Entries

Related Publications

Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
January 2017, Computational intelligence and neuroscience,
Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
January 2012, Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference,
Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
January 2022, Frontiers in psychology,
Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
January 2021, Frontiers in psychology,
Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
November 2014, NeuroImage,
Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
January 2021, Computational intelligence and neuroscience,
Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
July 2018, Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference,
Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
September 2021, IEEE transactions on cybernetics,
Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
January 2023, Computational intelligence and neuroscience,
Jiling Yu, and Yandong Ru, and Bangjun Lei, and Hongming Chen
August 2025, IEEE journal of biomedical and health informatics,
Copied contents to your clipboard!