219 Views

Semidynamics Tests 7bn Parameter Model on RISC-V AI IP

LinkedIn Facebook X
June 25, 2024

Get a Price Quote

Spanish RISC-V IP developer Semidynamics has recently conducted benchmarking to evaluate the performance of its Tensor Unit when running a LlaMA-2 7B-parameter Large Language Model (LLM) on an 'all in one' RISC-V AI IP core. This testing involved running the full LlaMA-2 7B-parameter model with BF16 weights on Semidynamics' All-In-One element, utilizing its ONNX Run Time Execution Provider to analyze the utilization of the Tensor Unit across all matrix multiplication layers within the model.

The results of the benchmarking showcase the effective combination of the tensor unit with the Gazzillion streaming data management IP, a crucial aspect for LLM models that heavily rely on transformer networks with memory-bound operations. Semidynamics demonstrated utilization rates exceeding 80% for various use cases, including sparse networks or shapes, irrespective of matrix sizes. This performance stands in stark contrast to other architectures available in the market.

According to Roger Espasa, CEO of Semidynamics, traditional AI designs typically incorporate three distinct computing elements – a CPU, a GPU (Graphical Processor Unit), and an NPU (Neural Processor Unit) interconnected via a bus. This conventional setup necessitates DMA-intensive programming, which is error-prone, sluggish, and power-intensive. Moreover, integrating three different software stacks and architectures poses a significant challenge. NPUs, being fixed-function hardware, lack the adaptability to accommodate future AI algorithms yet to be developed.

Conversely, Semidynamics has introduced a groundbreaking AI architecture that consolidates the three elements into a single, scalable processing unit. This integrated solution comprises a RISC-V core, a Tensor Unit responsible for matrix multiplication (akin to an NPU), and a Vector Unit handling activation-like computations (similar to a GPU). The architecture eliminates the need for DMA, adopts a unified software stack based on ONNX and RISC-V, and facilitates direct, zero-latency connectivity among the components. The outcome is enhanced performance, reduced power consumption, optimized area utilization, and a more developer-friendly programming environment, thereby lowering overall development costs.

Espasa further elaborated that the direct control of the flexible CPU over the Tensor and Vector Units enables the deployment of existing and future AI algorithms, safeguarding customer investments. The self-attention layers utilized in LLMs involve multiple matrix multiplications, a matrix Transpose operation, and a SoftMax activation function. While the Tensor Unit manages matrix multiplication, the Vector Unit efficiently handles Transpose and SoftMax operations. By sharing vector registers, the units minimize expensive memory copies, thereby mitigating latency and energy consumption during data transfers between MatMul and activation layers.

Recent Stories