Design Of Reinforcement Learning Grid World Navigation System Using Rewards And Penalties: Q-Learning, SARSA And Double Q-Learning

Uncategorized

Authors: Prachi Durge, Mahek Shribas, Mohanish Lanjewar, Parth Gadwal, Pranay Wadibhasme, Pranjali Nakhate

Abstract: This paper presents a systematic comparative study of three tabular reinforcement learning (RL) algorithms—Q-Learning,State-Action-Reward-State- Action (SARSA), and Double Q-Learning—deployed within a configurable stochastic GridWorld environment. The environment incorporates slip-based stochastic transitions, trap cells, potential-based reward shaping grounded in the theoretical guarantees of Ng et al. [1], and partial observability modes. The central research hypothesis investigates whether Double Q-Learning’s decoupled selection-evaluation mechanism demonstrably reduces maximization bias compared to vanilla Q-Learning, particularly under elevated stochastic transition probabilities. An interactive web-based research platform is developed using Flask and Chart.js, enabling real-time policy visualization, value-function heatmaps, Q-table analysis, and multi-seed benchmark comparisons with confidence intervals. Experimental results across three canonical grid configurations demonstrate that Double Q- Learning achieves superior convergence stability and reduced overestimation in high-slip environments, while SARSA exhibits inherently conservative on-policy behavior that trades off peak performance for robustness near traps.

DOI: https://doi.org/10.5281/zenodo.19830361

× How can I help you?