Training the largest AI models can take months on today’s computing platforms. For businesses, that is too slow.
The complexity of AI, high-performance computing, and data analytics is increasing, with some models, such as large language models, containing trillions of parameters.
The NVIDIA Hopper architecture was designed from the ground up to speed up these next-generation AI workloads by providing massive compute power and fast memory to handle growing networks and datasets.
The Transformer Engine, which is part of the new Hopper architecture, will significantly improve AI performance and capabilities, allowing large models to be trained in days or hours rather than days or hours.
Training AI Model with Transformer Engine
Transformer models are the foundation of today’s language models, such as BERT and GPT-3. Their versatility is increasingly being applied to computer vision, drug discovery, and other applications. They were originally developed for natural language processing use cases.
Model size, on the other hand, continues to grow at an exponential rate, now reaching trillions of parameters. Due to massive amounts of computation, training times are stretching into months, which is impractical for business needs.
Transformer Engine combines advanced software algorithms with 16-bit floating-point precision and a newly added 8-bit floating-point data format to boost AI performance and capabilities.
AI is trained using floating-point numbers with fractional components, such as 3.14. The TensorFloat32 (TF32) floating-point format, which was introduced with the NVIDIA Ampere architecture, is now the default 32-bit format in the TensorFlow and PyTorch frameworks.
The majority of AI floating-point math is done with 16-bit “half” precision (FP16), 32-bit “single” precision (FP32), and 64-bit “double” precision (FP64) for specialized operations (FP64). Transformer Engine makes it possible to train larger networks faster by reducing the math to just eight bits.
H100-accelerated server clusters, when combined with other new features in the Hopper architecture, such as the NVLink Switch system, which provides a direct high-speed interconnect between nodes, will be able to train massive networks that were previously impossible to train at the speed required for enterprises.
In-depth Into Transformer Engine
Transformer Engine makes use of software and custom NVIDIA Hopper Tensor Core technology to speed up training for models based on the transformer, a common AI model building block. These Tensor Cores can use a combination of FP8 and FP16 formats to speed up AI calculations for transformers. The throughput of Tensor Core operations in FP8 is twice that of 16-bit operations.
Models must intelligently manage precision in order to maintain accuracy while gaining the performance of smaller, faster numerical formats. Transformer Engine makes this possible by using custom, NVIDIA-tuned heuristics that dynamically choose between FP8 and FP16 calculations and handle re-casting and scaling between these precisions in each layer.
In comparison to previous-generation TF32, FP64, FP16, and INT8 precisions, the NVIDIA Hopper architecture triples the floating-point operations per second on fourth-generation Tensor Cores. Hopper Tensor Cores enable an order-of-magnitude speedup for HPC and AI workloads when combined with Transformer Engine and fourth-generation NVLink.
Boosting the Transformer Engine
Large language models, such as Megatron 530B, are at the forefront of AI research. The graph below depicts the increase in model size in recent years, which is expected to continue. Many researchers are already working on models with trillions of parameters for natural language understanding and other applications, demonstrating an insatiable appetite for AI compute power.
To keep up with the growing demand for these models, you’ll need a lot of computational power and a lot of high-speed memory. The NVIDIA H100 Tensor Core GPU delivers on both fronts, with Transformer Engine speedups allowing AI training to be taken to the next level.
When these innovations are combined, they result in increased throughput and a 9x reduction in training time, from seven days to just 20 hours.
Transformer Engine can also be used to infer without converting data formats. INT8 was previously the recommended precision for optimal inference performance. However, as part of the optimization process, the trained networks must be converted to INT8, which the NVIDIA TensorRT inference optimizer makes simple.
Using FP8-trained models, developers will be able to skip this conversion step entirely and perform inference operations with the same precision. Transformer Engine deployments, like INT8-formatted networks, have a much smaller memory footprint.
NVIDIA H100 inference per-GPU throughput on Megatron 530B is up to 30 times faster than NVIDIA A100, with a 1-second response latency, demonstrating it as the best platform for AI deployments: