Authors: Bhaskara B, Vinith M, Kumarswamy S
Abstract: – Understanding food from images poses marvellous challenge in the region of recipe search, with impactful applications in smart kitchens, dietary monitoring, and automated cooking assistance. Traditional approaches typically handle ingredient recognition and instruction generation as separate tasks, often resulting in incoherent or disjointed outputs. Here, we bring Autochef AI, multi-modal attention toll which seamlessly join visual and textual information to accurately identify ingredients and generate step-bystep cooking instructions from food images. By incorporating attention mechanisms across both image and text modalities, our model captures fine-grained features essential for coherent and contextually grounded recipe generation. Experimental results demonstrate that our approach significantly improves both ingredient prediction accuracy and instruction quality across a wide variety of recipes and cuisines.
DOI: https://doi.org/10.5281/zenodo.16522342