Authors: Professor Pradnya Patange, Atharv Pate, Harsh lonari, Mayuresh Kshirsagar, Manish Patil
Abstract: The rapid advancement of deepfake technology has introduced significant challenges to digital media authenticity, enabling the creation of highly convincing synthetic images and videos that are difficult to distinguish from genuine content. This study proposes an advanced deepfake detection framework based on the Temporal Vision-Language Transformer (TVLT), a cutting-edge multimodal deep learning architecture that jointly learns from visual, temporal, and semantic representations. Unlike traditional convolutional or recurrent models that focus solely on spatial or temporal domains, the proposed TVLT-based system integrates cross-modal attention to capture complex correlations among video frames, motion patterns, and audio-text alignment cues. The model efficiently identifies inconsistencies in facial movement, speech synchronization, lighting, and micro-expressions—features that deepfake generation methods struggle to replicate authentically