Autonomous Failure Diagnosis & Self- Healing In Large-Scale Cloud Systems Using Self-Reflective Agentic AI

Uncategorized

Authors: Dharunika G, Jack Robin J

Abstract: The increasing scale and dynamic complexity of modern cloud computing environments pose significant challenges for ensuring system reliability and availability, making traditional manual fault diagnosis and recovery insufficient. This paper explores advanced methodologies, contrasting established statistical monitoring techniques with emerging Artificial Intelligence (AI)-driven autonomous self-healing frameworks designed to manage faults, minimize downtime, and optimize resource utilization. Early methods employed correlation analysis using Canonical Correlation Analysis (CCA) and Exponentially Weighted Moving Average (EWMA) control charts for anomaly detection, followed by feature selection using ReliefF and SVM-RFE for problem location. The evolution toward AI-driven solutions leverages machine learning (ML), deep learning (DL), and reinforcement learning (RL) to achieve automated failure prediction and recovery. Recent advancements feature hybrid AI models and self- reflective multi-agent systems (SR-MACHA) that demonstrate enhanced accuracy and self- optimization capabilities, achieving significant reductions in Mean Time to Repair (MTTR) and improving system resilience. However, challenges remain regarding the scalability, computational cost of training complex models, and ensuring real-time performance in dynamic, heterogeneous cloud infrastructures.

× How can I help you?