Find ASIC Vendors

AI Processor Empowers Edge Computing with 7 Billion Parameter LLMs

August 08, 2024

Get a Price Quote

Advancements in technology have led to significant breakthroughs in the field of Generative AI, particularly in the realm of edge computing. One notable achievement is the flawless operation of the Qwen1.5-7B model running on a single Ara-2 AI processor at an impressive rate of 12 output tokens per second. This milestone is crucial as it enables LLMs and Generative AI models to be deployed at the edge, ensuring data privacy and reducing latency by eliminating the need for constant Internet connectivity.

By processing Generative AI tasks at the edge, users can benefit from a one-time investment in integrated hardware for their personal computers, avoiding the recurring costs associated with cloud services. This approach not only enhances the functionality of PCs but also empowers users to perform tasks such as documentation summarization, transcription, translation, and more, all while maintaining data privacy and reducing latency.

Qwen, an open-source project licensed under Apache 2.0 and supported by Alibaba Cloud (Tongyi Qianwen), offers a range of models, including the Qwen1.5-7B, designed for various functions such as chat, language understanding, reasoning, math, and coding. From a Natural Language Processing (NLP) perspective, Qwen enables users to execute everyday commands on their computers efficiently. Unlike traditional voice command systems, Qwen and other Generative AI chat models are multilingual, accurate, and not limited to specific text sequences.

Running Qwen1.5-7B and other large language models on the edge requires the Kinara Ara-2 processor to support key features such as aggressive quantization of AI workloads, end-to-end execution of all model operators without relying on external hosts, and sufficient memory and bandwidth to handle complex neural networks effectively. According to Wajahat Qadeer, Kinara's chief architect, achieving 12 output tokens per second on a 7B parameter LLM is a significant accomplishment, with plans to further enhance performance to 15 output tokens per second through advanced software optimizations.

Besides its applications in Generative AI, the Ara-2 processor demonstrates versatility by efficiently handling tasks like video stream processing for object detection, recognition, and tracking on edge servers. With the ability to process multiple video streams simultaneously, Ara-2 leverages its advanced compute engines to analyze high-resolution images swiftly and accurately. Available in various form factors, including stand-alone devices, USB modules, M.2 modules, and PCIe cards, Ara-2 offers a flexible solution for diverse edge computing applications.