Find ASIC Vendors

SemiDynamics Unveils All-in-One RISC-V NPU

May 06, 2025

Get a Price Quote

SemiDynamics, a company based in Spain, has recently unveiled a cutting-edge Neural Processing Unit (NPU) IP known as Cervell. This innovative technology integrates CPU, vector, and tensor processing to deliver exceptional performance of up to 256 TOPS. The primary focus of the Cervell NPU is to cater to the demands of large language models and AI recommendation systems, making it a versatile solution for various applications.

One of the key features of the Cervell NPU is its foundation on the RISC-V open instruction set architecture. This architecture allows for scalability, ranging from 8 to 64 cores, enabling designers to fine-tune performance based on specific application requirements. The flexibility offered by the RISC-V ISA allows for customization, with performance ranging from 8 TOPS INT8 for edge deployments to 256 TOPS INT4 for high-end AI inference in datacenter chips.

Following the successful launch of the all-in-one architecture in December, SemiDynamics has received positive feedback on the capabilities of the Cervell NPU. According to a white paper released by the company, the NPU is designed to meet the evolving needs of AI compute, offering scalable performance for edge inference, large language models, and more. The programmability of the Cervell NPU, coupled with the open RISC-V ISA, provides chip designers with a powerful and customizable foundation for building high-performance AI solutions.

Moreover, Cervell NPUs are specifically optimized for matrix-heavy operations, enhancing throughput, reducing power consumption, and enabling real-time responses. By integrating NPU capabilities with standard CPU and vector processing within a unified architecture, designers can achieve maximum performance across a wide range of AI tasks, from recommendation systems to deep learning pipelines.

The tight integration of Cervell cores with the Gazillion Misses memory management subsystem is another standout feature of this technology. This integration allows for up to 128 simultaneous memory requests, ensuring smooth data streaming with over 60 bytes/cycle of sustained throughput. Additionally, the parallel access to off-chip memory is crucial for tasks such as large model inference and sparse data processing, maintaining pipeline saturation even in bandwidth-intensive applications.