Authors: Gitesh Patil, Sakshi Mahajan, Shloka Shetty, Samruddhi Nevse, Dr. Arati R. Deshpande
Abstract: Conversational AI companions operating in emo-tionally sensitive and therapeutic contexts require the joint integration of speech understanding, affective reasoning, and photorealistic visual feedback — capabilities that existing systems address only in isolation, and largely through cloud-dependent infrastructure that introduces recurring costs and privacy con-cerns. Although recent advances in large language models and neural speech synthesis have improved the quality of automated dialogue, current systems lack the structural coupling between emotion recognition, semantic memory, and avatar-driven fa-cial expressiveness necessary for naturalistic human-computer interaction. This paper presents EmotionSync, a locally-hosted conversational AI companion capable of performing end-to-end affective interaction while maintaining real-time responsiveness. The proposed system integrates faster-Whisper-based speech-to-text transcription, Wav2Vec2 speech emotion recognition, retrieval-augmented generation over a ChromaDB vector store, locally-served LLaMA 3.1 language model inference, Microsoft Edge neural text-to-speech synthesis, and NVIDIA Audio2Face 3D blendshape-driven avatar animation within a unified Web-Socket streaming pipeline. By enforcing phrase-boundary audio chunking and performance.now()-anchored blendshape dispatch, the framework ensures frame-accurate lip synchronization and emotionally coherent response generation. The proposed frame-work contributes toward practical, privacy-preserving affective AI companions suitable for therapeutic, educational, and social interaction applications.