Authors: Kumarswamy S, Vidya Laxman Gadekar, Manasi
Abstract: Large Language Models (LLMs) have changed the landscape of natural language processing (NLP) with state of the art performance across numerous applications. Nonetheless, the computational and memory requirements for deployment in resource constrained environments are still a barrier. In this paper we describe the development of a 1.58-bit LLM which utilizes various quality quantization aware tuning and training techniques, and low- rank adaptation (LoRA), with additional memory efficient techniques (e.g., Flash Attention). The LLM quantization methods provide significant savings in both memory and energy consumption and retains competitive accuracy. Our experimental benchmarking demonstrates that effective training and quantization of LLMs can be applied to edge computing and other resource limited deployment methods.The advancement of Large Language Models (LLMs) has significantly transformed natural language processing (NLP) by achieving state-of-the-art results in multiple domains. Nevertheless, the highly computationally and memory-intensive nature of these models makes their deployment in resource-limited settings challenging. In this paper, we introduce the design of a 1.58-bit precision LLM with the state-of-the-art quantization approach and memory-efficient techniques including low- rank adaptation (LoRA) and Flash Attention. The proposed model offers a substantial cut in memory footprint and energy consumption, while maintaining a competitive accuracy. Experimental evaluations on benchmark datasets validate the effectiveness of this approach, demonstrating its applicability in edge computing and other resource-sensitive deployments.