Data Poison Detection Schemes For Distributed Machine Learning

Authors: Satyaki Adak

Abstract: Distributed Machine Learning (DML) enables efficient training over massive datasets by distributing computation across multiple nodes; however, it also increases vulnerability to data poisoning attacks, where adversaries inject malicious or mislabeled data to corrupt the learning process. Ensuring model integrity in such environments is a critical security challenge. This project classifies DML systems into basic-DML and semi-DML based on whether the central server participates in dataset training. For the basic-DML scenario, a novel cross-learning–based data poisoning detection scheme is proposed, where training results from distributed workers are compared through multiple training loops to identify anomalous behaviour. A mathematical model is developed to determine the optimal number of training loops that maximizes detection accuracy while minimizing overhead. For the semi-DML scenario, an enhanced poison detection mechanism is introduced by leveraging the central server’s computing resources, along with an optimal resource allocation strategy to reduce unnecessary computation. Experimental results demonstrate that the proposed schemes significantly improve model accuracy—up to 20% for Support Vector Machines and 60% for Logistic Regression in basic-DML—while reducing wasted resources by 20–100% in semi-DML. The proposed framework offers a general, efficient, and scalable defence against data poisoning attacks in distributed learning environments.

DOI: http://doi.org/10.5281/zenodo.18051593

Related posts

Follow Us on