NVSHMEM


NVSHMEM™ is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA® streams.


Get Started

Existing communication models, such as Message-Passing Interface (MPI), orchestrate data transfers using the CPU. In contrast, NVSHMEM uses asynchronous, GPU-initiated data transfers, eliminating synchronization overheads between the CPU and the GPU.

Efficient, Strong Scaling

NVSHMEM enables long-running kernels that include both communication and computation, reducing overheads that can limit an application’s performance when strong scaling.

Low Overhead

One-sided communication primitives reduce overhead by allowing the initiating process or GPU thread to specify all information required to complete a data transfer. This low-overhead model enables many GPU threads to communicate efficiently.

Naturally Asynchronous

Asynchronous communications make it easier for programmers to interleave computation and communication, thereby increasing overall application performance.



What's New in NVSHMEM 3.0

  • Added support for Multi-node systems that have RDMA networks (IB, RoCE, Slingshot, and so on) and NVIDIA NVLink(R) as multi-node interconnects.
  • Added support for ABI backward compatibility between host and device libraries. In the same NVSHMEM major version, the later host library will continue to be compatible with earlier device library versions. The work involved minimizing ABI surface between host and device libraries and versioning of structs and functions that are part of the new ABI surface.
  • Enhance NVSHMEM’s memory management infrastructure using object oriented programming (OOP) framework with multi-level inheritance to manage support for various device memory types (STATIC, DYNAMIC) and to enable support for newer memory types in the future.
  • Added support for PTX testing.
  • Added support for CPU assisted IBGDA via the NIC handler to manage NIC doorbell. The NIC handler can now be selected through the new environment variable - NVSHMEM_IBGDA_NIC_HANDLER. This feature would enable IBGDA adoption on systems that do not have the PeerMappingOverride=1 driver setting.
  • Improved performance by 20-50% for IBGDA setup when scaling up number of PEs, by batching and minimizing number of memory registration invocations for IB control structures.
  • Enhanced support to compose NVSHMEM_TEAM_SHARED on Multi-node NVLink (MNNVL)-based systems.
  • Improved performance for block scoped reductions by parallelizing send/recv data when sending small size messages. Also, the NVSHMEM device code that was compiled with NVIDIA CUDA (R) 11.0 and std=c++17 will automatically use cooperative group reduction APIs to improve the performance of local reductions.
  • Added IBGDA support to automatically prefer RC over DC connected QPs and update the default values of NVSHMEM_IBGDA_NUM_RC_PER_PE/NVSHMEM_IBGDA_NUM_DCI to be 1.
  • Added assertions in DEVX and IBGDA transport for checking extended atomics support in the RDMA NICs.
  • Added support for no-collective synchronization action in nvshmem_malloc/calloc/align/free, to follow OpenSHMEM spec-compliant behavior, when requested size or buffer in heap is 0 and NULL respectively.
  • Added support for nvshmemx_fcollectmem/broadcastmem device and onstream interfaces.
  • Improved performance tracing for on-stream and host collectives performance benchmarks using the cudaEventElapsedTime instead of the gettimeofday API.
  • Added support for performance benchmark bootstrap_coll for various bootstrap modalities.
  • Added support for “Include-What-You-Use” (IWYU) framework in the CMake build system.
  • Fixed several minor bugs and memory leaks.


Key Features


  • Combines the memory of multiple GPUs into a partitioned global address space that’s accessed through NVSHMEM APIs
  • Includes a low-overhead, in-kernel communication API for use by GPU threads
  • Includes stream-based and CPU-initiated communication APIs
  • Supports x86 and Arm processors
  • Is interoperable with MPI and other OpenSHMEM implementations


NVSHMEM Advantages


Increase Performance

Convolution is a compute-intensive kernel that’s used in a wide variety of applications, including image processing, machine learning, and scientific computing. Spatial parallelization decomposes the domain into sub-partitions that are distributed over multiple GPUs with nearest-neighbor communications, often referred to as halo exchanges.

In the Livermore Big Artificial Neural Network (LBANN) deep learning framework, spatial-parallel convolution is implemented using several communication methods, including MPI and NVSHMEM. The MPI-based halo exchange uses the standard send and receive primitives, whereas the NVSHMEM-based implementation uses one-sided put, yielding significant performance improvements on Lawrence Livermore National Laboratory’s Sierra supercomputer.


Efficient Strong-Scaling on Sierra Supercomputer



Efficient Strong-Scaling on NVIDIA DGX SuperPOD

Accelerate Time to Solution

Reducing the time to solution for high-performance, scientific computing workloads generally requires a strong-scalable application. QUDA is a library for lattice quantum chromodynamics (QCD) on GPUs, and it’s used by the popular MIMD Lattice Computation (MILC) and Chroma codes.

NVSHMEM-enabled QUDA avoids CPU-GPU synchronization for communication, thereby reducing critical-path latencies and significantly improving strong-scaling efficiency.

Watch the GTC 2020 Talk



Simplify Development

The conjugate gradient (CG) method is a popular numerical approach to solving systems of linear equations, and CGSolve is an implementation of this method in the Kokkos programming model. The CGSolve kernel showcases the use of NVSHMEM as a building block for higher-level programming models like Kokkos.

NVSHMEM enables efficient multi-node and multi-GPU execution using Kokkos global array data structures without requiring explicit code for communication between GPUs. As a result, NVSHMEM-enabled Kokkos significantly simplifies development compared to using MPI and CUDA.


Productive Programming of Kokkos CGSolve


Ready to start developing with NVSHMEM?


Get Started