Authors: Dr. Daniel Thompson, Dr. Olivia Bennett, James Walker, Dr. Hannah Collins, Andrew Richard
Abstract: Modern distributed systems operate at a scale and complexity that far exceed the limits of manual management and static fault-handling mechanisms, as they span geographically dispersed resources, heterogeneous hardware and software stacks, and dynamically changing workloads. In such environments, failures are not exceptional events but an inherent characteristic of normal operation, arising from partial outages, transient faults, software defects, and unpredictable interactions among system components. Autonomic computing emerged in the early 2000s as a response to these challenges, proposing self-managing systems capable of self-configuration, self-optimization, self-protection, and self-healing through continuous feedback and adaptation. Over the past two decades, advances in artificial intelligence and machine learning have substantially strengthened autonomic control loops, transforming them from rule-driven mechanisms into adaptive, data-driven decision systems that can learn from experience, generalize across failure scenarios, and operate effectively under uncertainty. This article presents a comprehensive overview of AI-driven autonomic control for self-healing distributed systems by synthesizing foundational autonomic computing architectures, closed-loop control models, and learning-based decision mechanisms. Leveraging established architectural diagrams from pre-2021 literature, we analyze how reinforcement learning, probabilistic reasoning, and hybrid AI techniques enhance fault detection, root-cause analysis, and recovery planning, and we conclude by highlighting key empirical studies and open research challenges that continue to motivate advances in intelligent, self-healing distributed systems.