Most neural network topologies heavily rely on matrix multiplication (MatMul), primarily because it is essential to many basic processes. Vector-matrix multiplication (VMM) is commonly used by dense layers in neural networks, and matrix-matrix multiplication (MMM) is used by self-attention mechanisms. The heavy dependence on MatMul can largely be attributed to GPU optimization for these kinds of tasks. Utilizing linear algebra libraries like cuBLAS and the Compute Unified Device Architecture (CUDA) enables MatMul operations to be effectively parallelized and accelerated, greatly improving performance.
Large language models (LLMs) frequently require the majority of their computing work to be done via matrix multiplication. As these models grow in embedding dimensions and context lengths, this load becomes even further. Even at billion-parameter scales, it is possible to completely remove MatMul processes from LLMs without compromising robust performance.
In a recent research, a team of researchers from the University of California, Santa Cruz, Soochow University, University of California, Davis, and LuxiTech has found out that for models up to at least 2.7 billion parameters, MatMul-free models can reach performance close to that of state-of-the-art Transformers, which normally use substantially more memory for inference. After extensive testing, the team discovered that the difference in performance between MatMul-free models and conventional full-precision Transformers gets smaller as the model size grows. This suggests that larger models don’t need to rely on MatMul operations to remain successful and efficient.
The team has created a GPU-efficient version that lowers memory use by up to 61% during training as compared to an unoptimized baseline in order to address the practicalities of implementing these models. They have used an optimized kernel for inference, which reduces memory consumption tenfold compared to unoptimized models. These models have become more accessible for a wider range of applications and more efficient due to the notable decrease in memory utilization.
The team has also developed a unique hardware solution on a Field-Programmable Gate Array (FPGA) to fully utilize the lightweight nature of these models. This technology processes billion-parameter scale models at 13 watts by exploiting lightweight operations that are beyond the capabilities of current GPUs. This efficiency brings LLMs closer to brain-like efficiency as it approaches the energy consumption of the human brain.
The research has shown that substantial reductions in LLM complexity are possible without sacrificing their ability to function well. It also illustrates the kinds of operations that the next generation of hardware accelerators should concentrate on in order to process lightweight LLMs. Large Language Model implementations that are more effective, scalable, and useful have been made possible by this development.
Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.