Authors: Sangani Harshil, Kalariya Meet, Baraiya Ravi, Vasani Bhumil, Dr. Vikram B.Kaushik
Abstract: The ability to automatically describe visual content through natural language represents a compelling frontier in artificial intelligence research. Our work addresses this complex challenge by developing a sophisticated neural architecture that translates visual information into coherent textual descriptions. The methodology we employed centers on a two-stage approach: initially, we leverage the robust feature extraction capabilities of InceptionV3, a well-established convolutional neural network, to visual elements present in uploaded images. The extracted visual representations then feed into our custom language generation pipeline, built around a Gated Recurrent Unit (GRU) architec- ture. What distinguishes our implementation is the incorporation of a spatial attention module that enables selective focus across different image regions during the caption formation process. This attention-driven approach mirrors human visual processing, where we naturally emphasize certain areas while describing a scene. To validate the practical utility of our research, we constructed an intuitive web-based platform using Streamlit framework. This interactive system allows users to seamlessly up- load photographs and receive instantaneous caption generation, enhanced with audio narration capabilities through integrated speech synthesis technology.