Enhancing Speech Synthesis With Human-Like Emotional Intelligence For Natural And Expressive Communication

Authors: Paul Binu, Paulu Wilson, Ronal Shoey George

Abstract: This paper presents an emotion-aware voice-based conversational therapy assistant that integrates speech recognition, con-versational AI, and emotional text-to-speech synthesis into a unified pipeline. The system captures user speech through a microphone, transcribes it to text, generates context-aware empathetic responses using a large language model (Gemini AI), and synthesizes emotion-ally expressive speech output using IndexTTS2 with zero-shot voice cloning. The architecture follows a modular design comprising four major modules: Voice Input, Processing and AI, Emotion Analysis, and Speech Synthesis. The emotion mapping subsystem identifies user affect and selects an appropriate response emotion to guide TTS output. Evaluation against two baselines (generic neutral TTS and rule-based keyword approach) demonstrates that the proposed model achieves the highest overall score of 74.51, significantly outper-forming both baselines in holistic end-to-end quality. The system balances emotion recognition accuracy, response relevance, and audio naturalness, making it suitable for mental health support, virtual assistants, and human-centered AI applications. The results confirm that combining emotional conditioning with contextual response generation yields substantially better conversational quality than neutral or rule-driven approaches.

DOI: https://doi.org/10.5281/zenodo.20045982

Related posts

Follow Us on