Nvidia goes deep into generative AI with new Blackwell GPU platform

At Nvidia‘s GTC event last week, the biggest news from the GPU (graphics processing unit) and AI (artificial intelligence) powerhouse was the launch of the Blackwell GPU architecture that it says will enable organizations to build and run real-time generative AI on trillion-parameter LLMs (large language models) at up to 25 times less cost and energy consumption than its predecessor.

Named in honor of David Harold Blackwell, a mathematician who specialized in game theory and statistics and the first Black scholar inducted into the National Academy of Sciences, the new architecture succeeds the company’s Hopper technology launched two years ago.

“For three decades, we’ve pursued accelerated computing with the goal of enabling transformative breakthroughs like deep learning and AI,” said Jensen Huang, Founder and CEO of Nvidia. “Generative AI is the defining technology of our time. Blackwell is the engine to power this new industrial revolution. Working with the most dynamic companies in the world, we will realize the promise of AI for every industry.”

Huang gave his GTC keynote on all things Nvidia to a standing-room-only crowd at Staples Center in San Jose, CA, and then later held a wide-ranging Q&A with attending media.

From general-purpose to accelerated AI computing

According to Huang, computing has reached the tipping point; general-purpose computing “has run out of steam. We need another way of doing computing so that we can continue to scale, drive down the cost of computing to consume more computing while being sustainable.”

He says that the industry is going through two transitions at one time. The first is how the computer is made, going from general-purpose to accelerating computing, and the second is what it can do—called generative AI.

“In this new world, the software is insanely complicated,” he said. “There’s nothing easy about ChatGPT, one of the greatest scientific breakthroughs ever.”

AI software is “very large,” and it’s getting larger, learning on words and images, but soon videos, using reinforcement learning, synthetic data generation, and going to get more sophisticated over time.

He’s looking for the data center to be the AI generator, producing data tokens for a new industry in “AI factories.” That was one of the reasons Nvidia created Blackwell, a new generation of computing for the “trillion parameter future.”

AI not just for inference anymore

Briefing the media at GTC, Huang said that Blackwell is revolutionary in a few ways.

One, it is designed to be very performant and energy efficient. In his keynote, he showed an example of training the same GPT (generative pre-trained transformer) for 1.8 trillion parameters in 90 days requiring only 4 MW of power instead of the previous generation’s 15 MW, saving 11 MW.

The second breakthrough is generation; beyond inference, the AI is now generating. Blackwell is designed to be a generative computer.

“This is the first time that people started to think about AI not just for inference,” he said. “In the future, images, videos, text, proteins, chemicals, kinematic control, they’re all going to be generated, and they can be generated by GPUs.”

A “generative engine” needed a special type of processor, so the company created Blackwell. Its new transformer engine and large NVLink enable generation of a lot of information very quickly by “parallelizing” many GPUs. The new architecture also allows for liquid cooling and the creation of larger 8-GPU NVLink domains.

Blackwell was designed with the same form factor as Hopper, “so you can pull out a Kopper and push in a Blackwell,” he said. “Our customer support transition map is going to be a lot easier because of this because the infrastructure already exists.”

Six transformative Blackwell technology details

Diving a little deeper, the new architecture is said to feature six transformative technologies for accelerated computing that will help unlock breakthroughs in data processing, engineering simulation, electronic design automation, computer-aided drug design, quantum computing, and generative AI—all emerging industry opportunities for the company. The technologies enable AI training and real-time LLM inference for models scaling up to 10 trillion parameters.

Packed with mind-boggling 208 billion transistors, Blackwell-architecture GPUs are manufactured using a custom-built 4NP process from strategic partner TSMC with two-reticle limit GPU dies connected by 10 TB/s chip-to-chip link into a single GPU—which the company calls the world’s most powerful chip.

Blackwell’s second-generation transformer engine—with new micro-tensor scaling support and advanced dynamic range management algorithms integrated into the company’s TensorRT-LLM and NeMo Megatron frameworks—supports double the compute and model sizes of its predecessor with new 4-bit floating point AI inference capabilities.

The latest fifth-generation iteration of the company’s NVLink delivers a groundbreaking 1.8 TB/s bidirectional throughput per GPU, ensuring high-speed communication among up to 576 GPUs for the most complex LLMs to accelerate performance for multitrillion-parameter and mixture-of-experts AI models.

Blackwell-powered GPUs feature a dedicated RAS (reliability, availability, serviceability) engine and capabilities at the chip level to use AI-based preventative maintenance to run diagnostics and forecast reliability issues. The combination maximizes system uptime and improves resiliency for massive-scale AI deployments to run uninterrupted for weeks or even months at a time while reducing operating costs.

Advanced confidential computing capabilities are said to protect AI models and customer data without compromising performance, with support for new native interface encryption protocols that are critical for privacy-sensitive industries like healthcare and financial services.

A dedicated decompression engine supports the latest formats, accelerating database queries to deliver the highest performance in data analytics and data science.

Massive superchips and NIMs

The GB200 Grace Blackwell superchip connects two B200 Tensor Core GPUs to the Grace CPU over a 900-GB/s ultra-low-power NVLink chip-to-chip interconnect. For the highest AI performance, GB200-powered systems can be connected to the company’s Quantum-X800 InfiniBand and Spectrum-X800 Ethernet platforms, both announced at GTC, that deliver networking at speeds up to 800 Gb/s.

The superchip is a key component of the GB200 NVL72, a multi-node, liquid-cooled, rack-scale system for the most compute-intensive workloads. It combines 36 Grace Blackwell superchips, which include 72 Blackwell GPUs and 36 Grace CPUs interconnected by the fifth-generation NVLink.

The GB200 NVL72 includes the company’s BlueField-3 data processing units to enable cloud network acceleration, composable storage, zero-trust security, and GPU compute elasticity in hyperscale AI clouds. The system provides up to a 30 times performance increase compared to the same number of its H100 Tensor Core GPUs for LLM inference workloads and reduces cost and energy consumption by up to 25 times.

The platform, which acts as a single GPU with 1.4 exaflops of AI performance and 30 TB of fast memory, is a building block for the newest DGX SuperPOD.

Along with Blackwell comes a new way of packaging complicated AI software with high-performance compute technology into “one container,” making it easy to download and use. The portfolio is supported by Nvidia’s AI Enterprise end-to-end operating system for production-grade AI that includes NIMs (Nvidia Inference Microservices), also announced at GTC.

Connecting the AIs together for customers are NIMs, which can be used off the shelf or customized. Nvidia will also help customers customize their own, offering the technology, expertise, and infrastructure to help companies build custom AIs like a foundry.

Cloud partners on board

Among the many organizations expected by Nvidia to adopt Blackwell are AWS, Dell Technologies, Google Cloud, Meta, Microsoft Azure, OpenAI, Oracle Cloud Infrastructure, Tesla, and xAI. Products will be available from partners starting later this year.

AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure will be among the first cloud-service providers to offer Blackwell-powered instances, as will the company’s Cloud Partner program companies Applied Digital, CoreWeave, Crusoe, IBM Cloud, and Lambda.

The GB200 will be available on Nvidia DGX Cloud, an AI platform co-engineered with leading cloud service providers that gives enterprise developers dedicated access to the infrastructure and software needed to build and deploy advanced generative AI models.

Cisco, Dell, Hewlett Packard Enterprise, Lenovo, and Supermicro are expected to deliver a range of servers based on Blackwell products, as are Aivres, ASRock Rack, ASUS, Eviden, Foxconn, GIGABYTE, Inventec, Pegatron, QCT, Wistron, Wiwynn, and ZT Systems.

Dramatic impact on simulation

The impact of accelerated computing’s speed up over general-purpose computing is dramatic for every industry Nvidia engages in, but in no industry is it more important than for using simulation tools to create products, said Huang.

“In this industry, it is not about driving down the cost of computing; it’s about driving up the scale,” he said. “We would like to be able to simulate the entire product that we do completely in full fidelity, completely digitally, in essentially what we call digital twins. We would like to design it, build it, simulate it, and operate it completely digitally—to accelerate an entire industry.

Nvidia highlighted some partners joining it to “CUDA (Compute Unified Device Architecture) accelerate” the entire ecosystem. CUDA is a parallel computing platform and API (application programming interface) that allows software to use certain types of GPUs for accelerated processing.

A growing network of software makers, including engineering simulation leaders Ansys, Cadence, and Synopsys will use Blackwell-based processors to accelerate their software for designing and simulating electrical, mechanical, and manufacturing systems and parts. Their customers can use generative AI and accelerated computing to bring products to market faster, at lower cost, and with higher energy efficiency.

Nvidia is working to CUDA accelerate Ansys engineering simulation by connecting it to the Omniverse digital-twin environment; Synopsys, its first software partner, on high-level chip design and computational lithography; and Cadence for its EDA/SDA (electronic design automation/Solomon design automation) tool digital-twin platform.

Cadence is also building a supercomputer out of Nvidia GPUs so that their customers could do CFD (computational fluid dynamics) simulation “at a thousand times scale, basically a wind tunnel in real-time,” said Huang.

“Imagine a day when tool providers [like Cadence, Synopsys, and Ansys] would offer you AI co-pilots so that we have thousands and thousands of co-pilot assistants helping us design chips and systems,” said Huang. “We’re accelerating the world’s CAE, EDA, and SDA, and we’re going to connect them all to Omniverse—the fundamental operating system for future digital twins.”

Next up for AI, the physical world

According to Huang, the next wave of AI will be better able to understand the physical world. A first step in that direction is what OpenAI has done with Sora in creating video from text.

After the AI understands the physical laws, the next step in the physical world is robotics, he said. The next generation will require new computers to run in the robot, tools like Omniverse so that the robots can learn in a digital twin, and the invention of new AI foundation models to complete the stack.