The latest generation of AI GPUs from Nvidia, called Blackwell, are overheating in server racks, according to a report from The Information.
Blackwell is a GPU designed for AI applications and is manufactured using foundry TSMC’s 4NP manufacturing process. When arranged in server racks with 72 components it is overheating, according to The Information referencing unnamed sources.
The Blackwell has 208 billion transistors across two full-sized, reticle-limited die alongside coherent memory all in a single package. At the board level two such Blackwell GPUs are located alongside a single ARM-based Grace processor, also designed by Nvidia.
It remains unclear whether the overheating is occurring in air- or liquid-cooled racks. The Blackwell GPU has the option to be liquid-cooled. The Nvidia GB200 NVL72 is a liquid-cooled rack-based system that comprises 36 Grace CPUs with 72 Blackwell GPUs that was introduced in March 2024.
Since then Nvidia has asked its suppliers to change the design of racks several times to try and resolve overheating problems, The Information, said referencing unnamed Nvidia employees.
Reuters quotes an Nvidia spokesperson saying in a statement: “Nvidia is working with leading cloud service providers as an integral part of our engineering team and process. The engineering iterations are normal and expected.”
The Grace-Blackwell combination includes extensive self-test allowing the system to swap out internal nodes while functioning.
At a launch event in March 2024 Nvidia CEO Jensen Huang said: “Every data centre hyperscaler is geared up, ODM manufacturers, every sovereign AI, telcos are ramping up with Blackwell, This will be the most successful product launch in our history.”