Accelerating Transformers with NVIDIA cuDNN 9

The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library for accelerating deep learning primitives with state-of-the-art performance.

cuDNN is integrated with popular deep learning frameworks like PyTorch, TensorFlow, and XLA (Accelerated Linear Algebra). These frameworks abstract the complexities of direct GPU programming, enabling you to focus on designing and training their models rather than worrying about the underlying hardware. cuDNN serves as the performance engine under the hood, ensuring that operations on these frameworks are executed with maximum efficiency.

More recently, scaled dot product attention (SDPA) has emerged as a performance-critical primitive in important workloads like large language models (LLMs). cuDNN added support for this primitive and has been improving its performance release-over-release using flash attention and other optimizations while expanding the functional support surface to enable a range of attention use cases.

On the NVIDIA H200 Tensor Core GPU, cuDNN can achieve up to 1.2 PFLOPS in FP8. As an end-to-end example, our team measured a 1.15x speedup after enabling cuDNN FP8 SDPA for Llama2 70B LoRA fine-tuning. This experiment used NVIDIA NeMo with NVIDIA Transformer Engine (TE) on an 8-GPU H200 node.

Software stack	e2e speedup
NeMo + TE with cuDNN disabled	1x (baseline)
NeMo + TE with cuDNN SDPA in BF16	1.11x
NeMo + TE with cuDNN SDPA in FP8	1.15x

Table 1. Impact of using cuDNN for SDPA as part of an end-to-end training run (Llama2 70B LoRA fine-tuning) on an 8-GPU H200 node

In this post, I present more details on the achievable performance with cuDNN SDPA, walk through how to use it, and briefly summarize some other notable new features in cuDNN 9.

Scaled dot product attention performance

NVIDIA began supporting attention by open-sourcing the fused Multihead Attention (fMHA) kernel in the APEX library, which fuses the attention algorithm into a single kernel. Tri Dao’s innovative work used this kernel as a starting point, delivering massive performance improvements and functionality in the form of flash attention. For more information, see /Dao-AILab/flash-attention on GitHub and see the FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning paper.

NVIDIA then advanced the state of the art in fused attention with a faster and more flexible implementation. This implementation is now the default fused attention backend in NVIDIA TE for NVIDIA Hopper architecture GPUs.

XLA provides a path to cuDNN SDPA today, which you can access through either the JAX SDPA API or by relying on the XLA compiler to lower from a customized implementation in JAX/PyTorch to cuDNN SDPA.

PyTorch eager mode SDPA doesn’t use cuDNN today, but a cuDNN-based implementation is in progress. For more information, see the PyTorch PRs for Fprop, and Bprop.

The cuDNN SDPA implementation encapsulates the following:

Deep understanding of the underlying NVIDIA hardware architectures
Implementations of all state-of-the-art SDPA algorithms, from non-flash to flash attention v2 and everything in between
Heuristics that automatically set performance knobs (such as tile size) based on the problem size and target GPU architecture

This leads to the best available performance on NVIDIA GPUs. Figures 1 and 2 show that across a wide variety of use cases, cuDNN 9 BF16 is up to 2x faster than the best available PyTorch eager implementation, also in BF16. The cuDNN FP8 implementation is up to 3x faster.

Better performance enables longer sequence lengths and shorter pretraining and fine-tuning time for the models. In line with other public benchmarks (such as flash-attention/benchmarks), this post reports GPU time only and does not include host overhead.

Bar chart shows that cuDNN v9 achieves higher performance for SDPA (forward only) compared to PyTorch eager implementation. — *Figure 1. SDPA (forward only) with causal mask, head dimension 128*

Bar chart shows that cuDNN v9 achieves higher performance for SDPA (forward plus backward) compared to the PyTorch eager implementation. — *Figure 2. SDPA (forward plus backward) with causal mask, head dimension 128*

Bar chart shows that cuDNN v9 achieves up to 1.2 PFLOPs SDPA (forward) in FP8. — *Figure 3. SDPA (forward only) with no causal mask, head dimension 256*

SDPA as cuDNN graphs

SDPA in cuDNN can be specified as a cuDNN graph of tensor operations. For a given graph, the cuDNN library has some set of engines that can execute it.

While some graphs may not have any suitable engines, the intent is to provide at least one engine for any cuDNN graph that’s practical to execute atomically on the GPU, usually in one fused kernel but sometimes in a small set of cooperative kernels. You can think of this as a “subgraph” of the overall framework graph.

SDPA (in its various forms) is an ideal example of such a subgraph. The engines supporting these patterns are designed to be as flexible as possible without trading off significant performance, using the best available algorithm, like flash attention. This means that you can make variations to an attention computation and still run it efficiently on the GPU. This section aims to explain the flexibility and support available today.

Figure 4 shows the cuDNN graph for the forward propagation (fprop) use case. This is the sequence of operations involved in the forward computation of an attention mechanism, SDPA, within a neural network.

A flowchart shows a series of operations including the generation of indices, pointwise operations, and selection operations that collectively represent the causal mask logic. — *Figure 4. Fine-grained fprop computation cuDNN graph for scaled dot product attention*

The causal mask logic lowers to the individual gen index, pointwise, and selection ops. The softmax logic lowers to a half-dozen cuDNN ops (Figure 4). This granularity of cuDNN ops gives you the flexibility to express customized models. Figure 5 shows the possible combinations.

A flowchart shows the various configurations supported by cuDNN for attention mechanisms, illustrating the flexibility in customizing attention operations for different deep-learning models. — *Figure 5. cuDNN attention support*

Figure 5 shows that with a larger head dimension (256) and no causal mask, cuDNN FP8 forward flash attention can achieve up to 1.2 PFLOPs.

You can also construct a custom graph with arbitrary pointwise operations between BMM1 and Softmax. This flexibility enables support for new variants that haven’t been discovered yet.

When creative researchers tinker with canonical attention patterns, they’re less likely to hit performance cliffs from falling back to less-optimized framework implementations. Figure 2 is for full training runs. Similar flexibility is available for the backprop attention graph.

SDPA usage walkthrough

There are several API entry points available for creating and running cuDNN graphs:

Frontend API (with both C++ and Python variants)
Backend API (C only)

All the cuDNN graph concepts described in the previous section apply to both levels of API. However, the cuDNN team recommends that you interface with the cuDNN Frontend API in either Python or C++, unless you need a C interface. The frontend API is significantly more concise and adds several conveniences.

For example, the frontend API extends the concept of an operation node to enable nodes that encapsulate multiple operations and dataflow between them. In other words, they are convenience nodes that abstract away the details of common graph patterns, such as SDPA. The nodes still enable flexibility to configure well-known variants. This SDPA usage walkthrough begins with the simplest case, an SDPA node created in Python.

The SDPA Python example in the frontend repo demonstrates the configuration options and the basic usage flow:

Initialize a cudnn.pygraph object with the appropriate data types.
Create Tensor objects that contain the dimensions, layout, and data type of the tensors.
Create the scaled dot product flash attention node and provide the required configurations.
Build the graph and provide the device pointers.
Execute the graph.

# 1. cuDNN graph
graph = cudnn.pygraph(
          io_data_type=cudnn.data_type.BFLOAT16,
          intermediate_data_type=cudnn.data_type.FLOAT,
          compute_data_type=cudnn.data_type.FLOAT,
        )

# 2. tensor attributes
q = make_tensor_attr(graph, q_gpu, "q")
k = make_tensor_attr(graph, k_gpu, "k")
v = make_tensor_attr(graph, v_gpu, "v")
attn_scale = make_tensor_attr(graph, attn_scale_cpu, "attn_scale", is_pass_by_value=True)

# 3. sdpa node with the configurations
o, stats = graph.scaled_dot_product_flash_attention(
                    name="scaled_dot_product_flash_attention",
                    q=q,
                    k=k,
                    v=v,
                    is_inference=false,
                    attn_scale=attn_scal,
                    bias=None,
                    use_alibi_mask=false,
                    use_padding_mask=false,
                    seq_len_q=None,
                    seq_len_kv=None,
                    use_causal_mask=true,
                    dropout=None,
                )

# 4. build the graph and provide the device pointers
graph.build()
variant_pack = {
                 q: q_gpu,
                 k: k_gpu,
                 v: v_gpu,
                 o: o_gpu
               }

# 5. execute the graph
graph.execute()

If you want more flexibility than what the SDPA node offers by default, the code that constructs the underlying SDPA graph is open source, so it can be customized.

For example, the following code example customizes the scale node within the SDPA node. For more information, see scaled_dot_product_flash_attention.h.

// Optional scale
if (options.inputs.Attn_scale) {
      // Lower options to scale options
      auto attn_scale_output = std::make_shared<Tensor_attributes>();
      attn_scale_output->set_is_virtual(true);

      Pointwise_attributes scale_attributes;
      scale_attributes.set_name("attn_scale");
      scale_attributes.set_mode(PointwiseMode_t::MUL);
      scale_attributes.inputs.IN_0 = last_output;
      scale_attributes.inputs.IN_1 = options.inputs.Attn_scale;
      last_output = scale_attributes.outputs.OUT_0 = attn_scale_output;
      auto scale_node = std::make_unique<PointwiseNode>(std::move(scale_attributes), context);
      
      sub_nodes.emplace_back(std::move(scale_node));
}

For more information about how to accelerate your own custom transformers, see the NVIDIA cuDNN documentation.

Other notable cuDNN 9 features

In addition to the SDPA improvements, cuDNN 9 introduces several other important improvements:

Mixed input precision support for matmuls and convolution
Improved error reporting
Hardware forward compatibility
Streamlined installation

Mixed input precision support for matmuls and convolutions

Matrix multiplication (matmul) and convolution APIs that require the data types of the input operands to be of the same type (FP16, FP32) are not suitable for cases like AWQ (Activation-aware Weight Quantization), where the activations are in FP16 and the weights may be in INT8. Assuming that you want to compute at the larger precision, you must cast the data. If not online, this in turn requires additional memory, along with the conversion cost.

cuDNN now supports mixed input precision matmuls and convolutions, where A and B operands can be different data types, with online fused type conversion for performance and memory optimization. You can choose between casting either operand to the other types. cuDNN handles the required conversions in optimized kernels.

Figure 6 shows speedups between cuDNN mixed input precision matmuls and an unfused workflow. The gray bars are cases where inputs A and B are in FP16 and INT8 precision, respectively, with A converted to INT8, followed by an INT8xINT8 matrix multiplication with INT32 accumulation. The green bars show a case where A is upconverted from INT8 to FP16, followed by an FP16xFP16 matrix multiplication with FP32 accumulation.

Chart shows the performance benefit of cuDNN mixed input precision matrix multiplications over traditional unfused workflows. — *Figure 6. Support for mixed input precision matmuls and convolutions in cuDNN*

Improved error reporting

Logging is essential in software development, especially for complex systems such as deep learning frameworks that use cuDNN. In the past, a common cuDNN pain point was the difficulty of debugging errors and warnings. We’ve been continuously improving our error reporting to address this.

cuDNN 9 adds to this work with the following:

More specific error codes
Categorization to help organize the increased number of error codes
Nested logging levels to align with logging conventions
cuDNNGetLastErrorString, which is a new function to get the last error message programmatically

For more information, see the Error and API Logging section of the developer guide.

Hardware forward compatibility

Before version 9.0.0, the cuDNN library supported up to the latest publicly available GPU architecture at the release date of the library. For example, cuDNN 8.9.x supported up through NVIDIA Hopper (that is, compute capability 9.0). Running 8.9.x on a future GPU architecture is not supported.

However, cuDNN 9 has hardware forward compatibility for a large subset of the API. This means that programs only using this subset of the API with cuDNN v9 will be functional on future GPUs, and users of these programs won’t be forced to upgrade their cuDNN installation to use a future GPU.

When running on some GPU with compute capability greater than 9.0, instead of erroring out, forward compatibility means that the library finds a functionally equivalent implementation and uses PTX JIT to target the new architecture.

For more information about support limitations and best practices, see the Hardware Forward Compatibility section of the developer guide.

Streamlined installation

In Python environments, you can now use pip to install the new Python frontend in addition to the library:

# cuDNN library for CUDA 12
pip install nvidia-cudnn-cu12 

# cuDNN frontend
pip install git+https://github.com/NVIDIA/cudnn-frontend.git

cuDNN 9 also has streamlined the installation process for RPM and Debian meta-packages. For example, the following commands install cuDNN on Ubuntu 22:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cudnn

This new flow simplifies the keyring setup as a Debian package install and abstracts away the exact cuDNN library version with the cudnn meta-package.

For more information and complete instructions, see the cuDNN Installation Guide.

Next steps

If you have feedback, questions, or comments, you can post on the cuDNN NVIDIA Developer forum (for cuDNN library topics) or the NVIDIA/cudnn-frontend on GitHub (for frontend topics). Download cuDNN to get started.

Stay tuned for more new cuDNN capabilities in the future. The progress outlined in this post is an important milestone, but much more is to come.

As AI continues to drive the industry to the limits of hardware and software integration, NVIDIA continues to optimize performance and improve user experience, so that cuDNN can be used more effectively and broadly across deep learning frameworks and graph compilers.

Acknowledgments

The NVIDIA cuDNN team contributed the technical content for this post, in collaboration with many other teams across the company.

Accelerating Transformers with NVIDIA cuDNN 9

Scaled dot product attention performance

SDPA as cuDNN graphs

SDPA usage walkthrough

Other notable cuDNN 9 features

Mixed input precision support for matmuls and convolutions

Improved error reporting

Hardware forward compatibility

Streamlined installation

Next steps

Acknowledgments

Related resources

Tags

About the Authors

Accelerating Transformers with NVIDIA cuDNN 9

Scaled dot product attention performance

SDPA as cuDNN graphs

SDPA usage walkthrough

Other notable cuDNN 9 features

Mixed input precision support for matmuls and convolutions

Improved error reporting

Hardware forward compatibility

Streamlined installation

Next steps

Acknowledgments

Related resources

Tags

About the Authors

Comments

Related posts

Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200

New NVIDIA NeMo Framework Features and NVIDIA H200 Supercharge LLM Training Performance and Versatility

Rapid Data Pre-Processing with NVIDIA DALI

Tensor Ops Made Easier in cuDNN

Related posts

Encoding and Compression Guide for Parquet String Data Using RAPIDS

Powering Mission-Critical AI at the Edge with NVIDIA AI Enterprise IGX

Webinar: Accelerate Edge AI Development With NVIDIA Metropolis Microservices For Jetson

Create, Share, and Scale Enterprise AI Workflows with NVIDIA AI Workbench, Now in Beta

Build Enterprise-Grade AI with NVIDIA AI Software