Researchers in the US have made a groundbreaking advancement in the field of artificial intelligence by running a high-performing large language model on an FPGA that utilizes the energy of a lightbulb.
The innovative low power AI technique, developed by researchers at the University of California Santa Cruz, focuses on eliminating the most computationally expensive and memory-intensive element of a large language model for generative AI. This breakthrough has significantly improved energy efficiency, reducing power consumption to just 13W for a billion-parameter LLM model.
This development has the potential to pave the way for a new generation of low-power custom edge AI chips, especially for small language models (SML). However, it is important to note that even though these models are based on transformers, which are known for their memory-intensive nature, the upcoming GPT4 model is estimated to have a staggering 1.76 trillion parameters.
Energy costs have long been a major challenge when it comes to running the latest LLMs for services like ChatGPT and GPT4 on GPUs. The research conducted at UCSC has successfully eliminated the computationally expensive matrix multiplication layer, offering a more energy-efficient solution.
Despite the reduced energy consumption and streamlined algorithm, the new open-source model has managed to achieve the same level of performance as models with significantly larger parameters, such as Meta's Llama LLM with 2.7 billion parameters. According to Jason Eshraghian, an assistant professor of electrical and computer engineering at the Baskin School of Engineering, "We got the same performance at way less cost — all we had to do was fundamentally change how neural networks work."
Modern neural networks heavily rely on matrix multiplication to weigh the importance of words and relationships within sentences. The larger the matrix, the more information the neural network can learn. However, this process often requires moving data between GPUs, resulting in high costs in terms of time and energy.
The low power AI strategy developed by the researchers involves avoiding traditional matrix multiplication by converting all numbers within matrices to ternary values, allowing for simpler computations based on summing rather than multiplying. By overlaying matrices and focusing only on essential operations, the complexity of hardware is significantly reduced.
By introducing time-based computation during the model training process, the researchers were able to maintain the neural network's performance while reducing the number of operations. This approach enhances the network's ability to retain important information, ultimately improving overall performance.
The researchers initially designed their neural network to operate on GPUs, achieving significantly lower memory consumption and faster operation compared to other models. By utilizing an optimized kernel during inference, the model's memory consumption can be further reduced, showcasing the efficiency of the new approach.
Collaborating with Assistant Professor Dustin Richmond and Lecturer Ethan Sifferman in the Baskin Engineering Computer Science and Engineering department, the researchers developed custom hardware on an FPGA clocked at 60MHz. The implementation of the MatMul-free token generation core on an Intel FPGA Devcloud platform has shown promising results.
The core's latency primarily stems from the ternary matrix multiply functional unit, which dominates core processing time. By leveraging a full 512-bit DDR4 interface and parallelizing the TMATMUL functional unit, a significant speed-up is projected without compromising clock rate or requiring additional optimizations.
The 1.3 billion parameter model utilized by the researchers exhibits impressive performance metrics, with a runtime of 42ms and a throughput of 23.8 tokens per second, reaching human reading speed efficiency comparable to the power consumption of the human brain.
Future work on the core could involve additional caching optimizations and functional unit enhancements to further improve efficiency. The researchers are optimistic about the potential for further optimization of low power AI technology with custom silicon, aiming to enhance energy efficiency even more.
While the current results are promising, the researchers acknowledge the need for testing the MatMul-free LM on extremely large-scale models with over 100 billion parameters to fully assess its capabilities. The availability of the code on Github provides transparency and opportunities for collaboration in advancing this transformative technology.
Jason Eshraghian emphasizes the vast potential of this technology, stating, "If we're able to achieve this within 13 watts, just imagine the possibilities with a data center's worth of compute power. Let's effectively utilize our resources to push the boundaries of AI innovation."
For more information and access to the code, visit www.ucsc.edu.