Find ASIC Vendors

Meta Unveils Next-Gen Custom AI Chip

April 13, 2024

Get a Price Quote

Meta has recently unveiled the second generation of its Meta Training and Inference Accelerator (MTIA) custom AI chip. This new chip, developed by Meta (formerly known as Facebook), is specifically designed to cater to memory-bound large language models (LLMs) utilizing transformer frameworks. The introduction of this custom chip aligns with the current trend among data center operators to create their own chips tailored for specific functions.

The architecture of the MTIA chip emphasizes a delicate balance between compute power, memory bandwidth, and memory capacity, particularly for serving ranking and recommendation models. In the realm of inference, the chip is engineered to deliver high utilization even with relatively low batch sizes. By incorporating outsized SRAM compared to traditional GPUs, the MTIA chip can ensure high utilization in scenarios where batch sizes are restricted, while also providing ample compute power for handling larger amounts of potential concurrent work.

The 5nm MTIA accelerator comprises an 8x8 grid of processing elements (PEs) housing 2.35 billion transistors within a die measuring 25.6mm x 16.4mm, or 421mm², in a 50mm x 40mm form factor. These PEs offer significantly enhanced dense compute performance, showcasing a 3.5x improvement over the previous MTIA version, along with a remarkable 7x enhancement in sparse compute performance.

One of the key advancements in the MTIA2 chip is the improved network on chip (NoC) architecture, which doubles the bandwidth and facilitates seamless coordination between different PEs at low latency. These enhancements, coupled with other new features within the PEs, represent crucial technologies that are pivotal to Meta's long-term roadmap for scaling MTIA to handle a broader spectrum of more complex workloads.

Running at 1.35GHz from a 0.85V supply within a 90W thermal envelope, the MTIA2 chip delivers impressive performance metrics of 708 TFLOPS/s (INT8) with a sparse AI model or 354 TFLOPS/s with INT8. The chip is integrated into a rack-based system capable of accommodating up to 72 accelerators, spread across three chassis, each containing 12 boards housing two accelerators each.