November 23, 2023

Authors: Sai Sukesh Reddy Tummuri

Abstract: Large language models powered by transformers have grown quickly, resulting in previously unheard-of performance improvements, but at the expense of high computational complexity, memory usage, and energy consumption. Their deployment in real-time and resource-constrained environments is hampered by these limitations. In order to improve inference efficiency while maintaining predictive accuracy, this paper proposed a novel Dynamic Sensitivity-Aware Quantization-Aware Training (DSA-QAT) framework. The suggested method adaptively adjusted quantization precision based on layer-wise sensitivity and training dynamics, in contrast to traditional quantization approaches that apply uniform precision reduction. This allowed for more informed precision allocation across transformer components. Using representative performance and efficiency metrics, controlled simulation experiments were used to assess the suggested framework. According to experimental results, the quantized model maintained balanced precision, recall, and F1-score values while achieving prediction accuracy above 97%. The model also demonstrated strong robustness against quantization noise, decreased inference latency, a smaller memory footprint, improved energy efficiency, and stable training loss convergence. Additionally, a notable decrease in model size was noted, allowing for effective deployment without sacrificing performance. Overall, the findings demonstrated that the suggested DSA-QAT framework successfully reduced the trade-off between accuracy and model efficiency. The study demonstrated the potential of adaptive quantization-aware strategies for the high-performance, scalable, and sustainable deployment of large language models in practical applications.

Published by: vikaspatanker

Daily Archives: November 23, 2023

Quantization Aware Training Techniques for Efficient Transformer-Driven Large Language Models

Categories

PAGES