nvshmem/changelog at devel · NVIDIA/nvshmem · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
===============================================================================
                               Changes in 3.6.5
===============================================================================

# Added configuration file support (similar to NCCL) for easy and repeatable environment variable management.
# Added experimental NVSHMEM LTO-IR (Link-Time Optimization IR) library build option for improved device code optimization.
# Added enhanced user buffer registration with preferred address support via `nvshmemx_buffer_register_symmetric`.
# Added error code return values for tile API calls to improve error handling.
# Added multi-NIC support for `libfabric` transport with round-robin NIC selection. The new environment variable `NVSHMEM_LIBFABRIC_MAX_NIC_PER_PE` controls the maximum number of NICs used per PE.
# Improved version mismatch error messages to include detailed host and device library version information.
# Fixed IBGDA to activate CST optimizations on supported architectures.
# Fixed LLVM Dead Store Elimination removing WQE stores in bitcode library builds.
# Fixed `NVSHMEM_TEAM_SHARED` initialization for imbalanced P2P-connected groups of PEs in MNNVL setups.
# Fixed `libfabric` transport compatibility issues between build and runtime versions.
# Fixed PMIx bootstrap to handle empty environment variables correctly.
# Fixed host RMA off-stream transport capability check.

NVSHMEM4Py

# Added CuTe DSL support for NVSHMEM4Py with device-side bindings for RMA, collective, AMO, and memory operations.
# Added device-side construction of peer and multicast tensors for CuTe DSL.
# Added helper functions to simplify NVSHMEM/CuTe DSL usage.
# Made Numba-CUDA an optional dependency and bumped minimum version to 0.25.
# Fixed peer/multimem buffer tracking assumptions for parent buffer cleanup.

===============================================================================
                               Changes in 3.5.19
===============================================================================

# Added tile-granular RMA routines ``tile_put``, ``tile_get``, and ``tile_broadcast`` APIs
# Added CUTLASS support to tile API
# Added qpair-specific APIs (``nvshmemx_qp_*``) that provide RMA operations on specific queue pairs abstracted via ``nvshmemx_qp_handle_t``
# Added LLVM bitcode library support for IBGDA
# Added option to pass the CUDA device to ``nvshmemx_init_attr`` to set a different device when using NVSHMEM
# Added environment variable ``NVSHMEM_MAX_PEER_STREAMS`` to set the maximum number of CUDA streams per node
# Renamed ``tile_allreduce`` to ``tile_reduce`` and ``tile_reduce`` to ``tile_rooted_reduce`` to align with other NVSHMEM collectives
# Removed static-only version `libnvshmem.a`, users should instead link to `libnvshmem_host` and `libnvshmem_device`
# Removed `realloc` and `alltoalls` declarations because these functions are not implemented in NVSHMEM
# Improved EFA transport (libfabric) with multiple bug fixes and performance improvements for AWS environments
# Updated number of used QPs from 4 to 8 for full bandwidth with data direct
# Changed default ``NVSHMEM_MAX_MEMORY_PER_GPU`` from 128 GiB to 256 GiB
# Improved ``NVSHMEM_HCA_PREFIX`` to accept ``^`` and updated default value
# Refactored team duplication to enable CUDA graph capture
# Changed to use 1 CUDA stream when NVSwitch is present
# Updated hydra installation script to install version 4.3.2
# Improved error catching and reporting for initialization and synchronization routines
# Fixed race condition in team mapping kernel waiting for peer pe_in_team initialization
# Fixed race condition in barrier causing hangs on unordered networks
# Fixed reduce validation issues in cases when PE count was not a power of 2
# Fixed stream memory operations to use ``cuStreamWriteValue`` only for self-writes
# Fixed ``nvshmem_calloc`` not taking into account the ``count`` argument
# Fixed several minor bugs and memory leaks

NVSHMEM4Py
# Exposed the following APIs to Python: device collectives, AMO and signaling, external buffer registration, arbitrary team creation, multicast buffers, and signal_op for signaling without requiring data movement
# Added ability to allocate Fortran-memory-ordered arrays and tensors
# Added CUDA-related search paths and update dependency pinning to enable CUDA 13.0
# Added flag to disable NVSHMEM4Py device build
# Added cumodule init/finalize core API support
# Added safety checks around multicast/peer buffer creation
# Added ``multicast_array``/``peer_array`` implementations and enable ``bfloat16`` for Numba DSL
# Added RMA high-level bindings for ``nvshmem.core.device.numba``
# Added Numbast to generate NVSHMEM CUDA device bindings
# Added parameter ``except_on_del`` to allocation functions to raise an exception when a buffer leak is discovered or an already freed buffer is used
# Added missing documentation to high level bindings for nvshmem APIs
# Changed to use cuda-core pathfinder instead of relying on ``LD_LIBRARY_PATH``
# Updated `requirements.txt` to pull in the correct versions of `nvcc` and `nvjitlink` which do not interfere with Torch
# Fixed several minor bugs

===============================================================================
                               Changes in 3.4.5
===============================================================================

# Added support for data direct NIC configurations in the IB transports.
# Added a new environment variable, `NVSHMEM_DISABLE_DATA_DIRECT`, to force disable data direct NIC even when present.
# Added support for CPU-Assisted IBGDA without the use of GDRCopy or the x86 regkey.
Systems not supporting the other methods will automatically fall back to this new method.
It enables the use of IBGDA on a broad range of systems without the need for administrator intervention.
# Added a new environment variable `NVSHMEM_HCA_PREFIX` to enable IB transports on systems which
name their HCA devices in a non-standard way (for example, `ipb*` instead of `mlx5*`).
# Deprecated support for the combined `libnvshmem.a` host and device static library. See the Release Notes for more details.

===============================================================================
                               Changes in 3.3.9
===============================================================================

# Enabled GA platform support for Blackwell B200/GB200NVL72-based systems. Additionally, enabled SASS support for Ada architecture.
# Added support for official python language bindings (`nvshmem4py`) enabling symmetric memory management, on-stream RMA and collective APIs
to aid in development of custom kernels using symmetric memory and enable fine-grained communication in native Python.
The ``nvshmem4py`` package is available via PyPi wheels/conda installers.
# Added support for fast-kernels centric (cuTLASS) tile-granular NVLS device-sided collectives to aid
development of fused distributed GeMM kernels
# Added support for flexible team initialization API (``nvshmemx_team_init``) using an arbitrary set of PEs to enable non-linear,
non-contiguous PE indexing, if desired.
# Added support for symmetric user-buffer registration (``nvshmemx_buffer_register_symmetric``) to enable ML
frameworks to "bring-your-own-buffer" (BYOB) for zero-copy communication kernels.
# Added support for narrow-types (``float16, ``bfloat16``) precision support for NVLS ``reducescatter`` collective
and LL8 `fcollect` algorithm for low-latency collectives.
# Added support for device-side ``nvshmem_broadcastmem``, ``nvshmem_fcollectmem`` APIs in the library.
# Added support for CUDA module-independent loading using ``nvshmemx_culibrary_init``.
# Added support for leveraging multiple Queue-Pairs (QPs) on LAG bonded NICs for RDMA transports. Users can use
``NVSHMEM_IB_NUM_RC_PER_DEVICE`` environment variable to tune this value, as desired.
# Added support to randomize QP assignment for multiple GPU endpoints when communicating over IBGDA transport.
# Added CUDA graph capture capabilities to on-stream collectives performance benchmarks using `--cudagraph` command-line
parameter.
# Enabled host-side clang compilation support for NVSHMEM host library.
# Improved GPU thread-occupancy for on-stream ``fcollect`` when utilizing NVLS and LL algorithms by 30%.
# Improved multi-SM NVLS on-stream collectives to utilize CTA adaptively as a function of NVLINK domain size.
# Improved runtime-detection of CUDA VMM support and fallback to legacy pinned memory allocation `cudaMalloc`,
when platform support is not available for VMM.
# Improved dynamic GID detection for RoCE transports in containerized environment, when querying the GID from sysfs could throw errors.
# Improved perftest presentation layer to provide additional ``count`` column capturing total number of elements per operation,
independent of datatype size.
# Improved pt-to-pt signaling latency to always leverage CE-centric APIs `cuStreamWriteValue/cuStreamWaitValue` by 20%.
# Fixed a bug in perftest reporting when both datatype, reduceop is specified.
# Fixed an application crash with ``nvshmemx_fcollect_on_stream`` attempts to use more CTA than available NVSHMEM teams.
# Fixed an application crash when NVSHMEM remote transports attempt to use more than 16 HCAs per node.
# Fixed an application crash in the ``nvshmemx_mc_ptr`` API that is caused when executed on a platform without NVLS support.
# Fixed a compile-time error with LLVM IR bitcode device library when compiling with clang-llvm > 19.
# Fixed a compile-time error with IBGDA support when built without GDRCopy support.
# Fixed a compile-time error with ``moe_shuffle.cu`` due to a missing `getopt` header.
# Fixed a data corruption bug in device-side pt-to-pt get/put bandwidth test due to missing usage of non-symmetric
memory buffers for bandwidth summarization.
# Fixed a host clang compilation bug due to missing `__CUDA_ARCH__` conditional check for non-CUDA device inline
assembly codepath.
# Fixed a bug in the symmetric memory management layer that was caused due to missing override for ``NVSHMEM_CUMEM_GRANULARITY``
for static device memory heaps (`cudaMalloc`).
# Fixed a data corruption bug in on-stream NVLS 2-shot allreduce, 1-shot allgather collective that was caused
due to missing memory fence to order data and barrier.
# Fixed a stale sysfs filepath used for nvidia-peermem discovery that caused regression when running with
CUDA 12.8 or above
# Fixed a hang in MPI bootstrap allgather collective that was caused by incorrect usage of MPI_IN_PLACE
operations.
# Fixed a data correctness bug in the LLVM IR bitcode device library that caused incorrect results
for 16-byte aligned put/get operations which had a size that was not a multiple of 16 bytes.
See https://forums.developer.nvidia.com/t/bug-with-nvshmem-3-2-5-for-bitcode-compiling/327847

===============================================================================
                               Changes in 3.2.0
===============================================================================

# Enable experimental platform support for Blackwell B200-based systems.
# Added one-shot and two-shot NVLINK SHARP (NVLS) allreduce algorithms for
half-precision (``float16``, ``bfloat16``) and full-precision (``float32``) datatypes on
NVLINK4 and NVLINK5 enabled platforms.
# Added multi-SM based acceleration of TP collectives (`reduce`, `fcollect`,
`reducescatter`) to improve NVLINK BW utilization on NVLINK4-enabled platform to
achieve 8x/16x speedup for medium to large-message size (>=1MB).
# NVSHMEM now also ships LLVM IR bitcode device library to support MLIR-compliant compiler
toolchain integration on new and upcoming Python DSLs (Triton, Mosaic, Numba, and so on).
This feature enhances perftest to support cubin-based ``cudaCooperativeLaunch`` and
kernel function-based ``nvshmemx_collective_launch`` execution to improve robustness
of the new bitcode device library.
# Enhanced NVSHMEM host/device side collective and pt-to-pt to use new command-line
interface to support the runtime tunability of message size, datatype, reduce op, iterations, and so on.
# Improved heuristics for the automatic selection of on-stream NVLS
collectives for allgather, reducescatter, and allreduce operations that span
NVLINK-connected, GPU-based systems.
# Eliminates dynamic link-time dependency on MPI and SHMEM on perftest and examples
and replaces them with the dynamic load-time capability in the perftest and examples.
# Fixed a bug that was related to incorrect bus bandwidth reporting in
``shmem_p_bw``, ``shmem_g_bw``, ``shmem_atomic_bw``, ``shmem_put_bw``, and ``shmem_get_bw`` perftests.
# Fixed a bug that was related to rounding error in NVLS reducescatter min and
max operation due to incorrect usage of vectorized ``float16`` instead of ``uint32`` datatypes.
# Fixed a bug that was related to dynamic loading of an unversioned bootstrap library.
# Fixed a bug that was related to linking CMake projects to system installer packages.
# Fixed a bug that was related to building heterogenous version of device library.
# Fixed a bug that was related to establishing QP connection in IBGDA transport
when using Dynamic Connection (DC) mode.
# Fixed a bug that was related to building perftests for earlier CUDA versions
(for example, 11.8) that do not support half-precision datatypes (for example, ``__nv_bfloat16``).
# Fixed a bug that was related to ABI compatibility breakage for allreduce maxloc op.
# Fixed a bug that was related to non-deterministic deadlock/race condition on the GPU when mixing
``nvshmemx_team_split_strided`` with ``nvshmemx_barrier_all_on_stream`` operation back-to-back.
# Fixed a bug that was related to out-of-memory (OOM) during dynamic device memory
based symmetric heap reservation on platforms with > 8 NVLINK connected GPUs.
# Fixed a documentation bug that was related to incorrect usage of
``MPI_Bcast`` and unversioned ``nvshmemx_init_attr_t`` structure when initialization NVSHMEM using unique ID.
# Fixed a bug that was related to host memory corruption/free when creating
multiple teams using ``nvshmem_team_split_strided``.

===============================================================================
				Changes in 3.1.0
===============================================================================

# Added support for NVLINK SHARP (NVLS) based collective algorithms on x86 + Hopper and
Grace Hopper architecture based single and multi-node NVLINK platforms for
popular deep-learning collective communications (ReduceScatter, Allgather, Allreduce)
device and on-stream APIs. This feature improves latency for small-message
size by 2-3x speedup, when compared with one-shot algorithms over
NVLINK.
# Added support for GPU kernels that wish to utilize a low-level query API to
NVLS enabled symmetric memory using `nvshmemx_mc_ptr` host and device API
for a given target `team`.
# Added support for new Low-Latency protocol (LL128) for Allgather collective
communication device and on-stream APIs.
# Enhanced support for existing low-latency protocol (LL) warp-scoped collective
to provide a 2x speedup, over traditional algorithms when scaling up number of GPUs upto 32.
# Added support for half-precision (FP16/BF16) format on collective
communication (ReduceScatter, Allgather, Allreduce) on-device and on-stream
APIs.
# Added support for Python wheels via PyPI repository and rpm/deb package
distribution.
# Added support for dynamic RDMA Global Identifier (GID) discovery for RoCE
transports. This feature enables automatic fallback to the discovered GID
without requiring the user to specify the GID via runtime variable.
# Added support for a heterogenous library build system. This feature allows
the NVSHMEM static library to be built with a separate CUDA version from the NVSHMEM host library.
This enables new features such as NVLS in the host library while still
allowing applications compiled against lower versions of CUDA to link to the
NVSHMEM device library, making the entire library portable to different CUDA
minor versions while remaining feature complete. Users can specify a distinct
CUDA version for the device library by specifying
``NVSHMEM_DEVICELIB_CUDA_HOME=<PATH TO CUDA>``. Otherwise the host CUDA version will be used.
# Enhance support for NVSHMEM on_stream signal APIs to use
`cuStreamWriteValue()` over P2P connected GPUs when possible. This makes it possible
to have a zero-SM implementation of the on_stream signalling op when possible.
# Added support for DMABuf based registration of NIC control-structures in
IBGDA to leverage DMABuf mainline support in newer linux kernels (over
proprietary solution nvidia-peermem).
# Added a sample code for NVSHMEM UniqueID (UID) socket based
bootstrap modality under `examples` directory.
# Added support for NVSHMEM performance benchmarks to our release binary
packages.
# Removed host API based nvshmem collectives performance benchmarks.
# Enhanced collectives performance by adding new metrics - Algorithmic Bandwidth
(algoBW) and Bus Bandwidth (BusBW) to NVSHMEM performance benchmarks.
# Fixed support for Ninja build generator in our CMake build system.
# Fixed a runtime bug related to use of ``NVSHMEM_DEVICE_TIMEOUT_POLLLING`` build time variable.
# Enhanced our CI pipelines to support job-specific timeout to force
early termination of any job that is hung on the GPU or CPU and avoid pipeline
starvation of subsequently queued jobs on the same system.
# Fixed a performance bug in on-stream collectives perftest related to use of
cudaMemcpyAsync on the same CUDA stream, where cudaEvent for profiling start &
end time of the on-stream communication kernel are submitted
# Fixed a bug related to virtual member functions of
`nvshmemi_symmetric_heap` by forcing its access specifier to be protected to
limit its access to only inherited child classes
# Fixed a bug related to recursive destructor memory corruption and
`nullptr` access to static member function of `nvshmemi_mem_transport`
class.
# Fixed a bug with incorrect compile-time value for
``NVML_GPU_FABRIC_STATE_COMPLETED`` and ``NVML_GPU_FABRIC_UUID_LEN``
constants.
# Fixed a bug in ``nvshmemx_collective_launch_query_gridsize`` which could
cause it to erroneously return a gridsize of 0.
# Fixed a bug during ``nvshmem_init`` which could cause application to crash in MNNVL discovery
when use with CUDA compat libraries at runtime for CUDA toolkit > 12.4.
# Fixed a bug in ``nvshmemx_collective_launch`` which could cause duplicate initialization of
nvshmem device state.
# Fixed a bug related to uninitialized variables in IBGDA device code.
# Fixed a bug related to out-of-bound access (OOB) in atomic BW performance
test.
# Fixed a bug related to missing C/C++ `stdint` headers on Ubuntu24.04 + x86
based systems.
# Fixed a bug related to incorrect calculation of team specific stride when
creating a new team using `nvshmem_team_split_strided`.
# Enhance the reduce-based collective symmetric memory scratch space to
512KB to accomodate additional space for reducescatter based collectives.

===============================================================================
				Changes in 3.0.6
===============================================================================

# Added support for Multi-node systems that have both RDMA networks
(IB, RoCE, Slingshot, etc) as well as NVLink as a multi-node interconnects.
# Added support for ABI backward compatibility between host and device libraries.
Within the same NVSHMEM major version, newer host library will continue to be
compatible with an older device library version. The work involved minimizing
ABI surface between host and device libraries and versioning of structs and
functions that are part of the new ABI surface.
# Enhance NVSHMEM's memory management infrastructure using object oriented
programming (OOP) framework with multi-level inheritance to manage support for
various memory types and to enable support for newer memory types in the future.
# Added support for PTX testing in NVSHMEM.
# Added support for CPU assisted IBGDA via the NIC handler to manage NIC doorbell.
The NIC handler can now be selected through the new environment variable -
`NVSHMEM_IBGDA_NIC_HANDLER`. This feature would enable IBGDA adoption on systems
that don't have `PeerMappingOverride=1` driver setting.
# Improved performance of IBGDA transport initialization by 20-50% when scaling up
the number of PEs, by batching and minimizing the number of memory registration
invocations for IB control structures.
# Enhance support for composing NVSHMEM_TEAM_SHARED on Multi-node NVLink (MNNVL)
based systems.
# Improved performance for block scoped reductions by parallelizing send/recv data,
when sending small size messages. Also, NVSHMEM device code compiled with CUDA 11.0
and std=c++17 will automatically make use of cooperative group reduction APIs to
improve performance of local reductions.
# Fixed implementation of system scoped atomic memory operations (AMO)
such as `nvshmem_fence/atomic_<ops>` and signaled operations `nvshmem_signal_<op>`
when communicating over NVLink.
# Added IBGDA support to automatically prefer RC over DC connected QPs and update
the default values of `NVSHMEM_IBGDA_NUM_RC_PER_PE/NVSHMEM_IBGDA_NUM_DCI` to be 1.
# Added assertions in DEVX and IBGDA transport for checking extended atomics
support in the RDMA NICs.
# Added support for no-collective synchronization action in
`nvshmem_malloc/calloc/align/free`, to follow OpenSHMEM spec compliant behavior,
when requested size or buffer in heap is 0 and NULL respectively.
# Added support for `nvshmemx_fcollectmem/broadcastmem` device and onstream APIs
# Improved performance tracing for on-stream and host collectives performance
benchmarks using `cudaEventElapsedTime` instead of `gettimeofday` API.
# Added support for performance benchmark `bootstrap_coll` for various bootstrap
modalities in NVSHMEM.
# Added support for "Include-What-You-Use" (IWYU) framework in CMake build system.
# Removed support for deprecated Power-9 systems.
# Removed support for deprecated makefile build system. NVSHMEM now support CMake
build system exclusively
# Fixed a bug in remote transports during memory regisration and deregistration,
with respect to memory handle management cache.
# Fixed a bug in QP mapping options `NVSHMEM_IBGDA_DCI_MAP_BY=warp` or
`NVSHMEM_IBGDA_RC_MAP_BY=warp`, which previously lead to suboptimal mapping of
QPs to warps/DCTs.
# Fixed a bug to dynamically load explicitly versioned `libcuda.so` and `libnvml.so`.
# Fixed a bug in computing NVSHMEM team symmetric heap memory requirements during
runtime initialization.
# Fixed a bug related to stale filepaths when aborting a NVSHMEM runtime.
# Fixed a bug when building NVSHMEM remote transports with
`HAVE_IBV_ACCESS_RELAXED_ORDERING` set.
# Fixed a bug that exhibits the behavior of a GPU device hang, when using RC QP
type with IBGDA.
# Fixed a bug with an incorrect value of broadcast LL threshold.
# Fixed a bug in IBDEVX related to incorrect endianness check.
# Fixed a memory leak in `nvshmem_team_destroy` related to missing teardown for
two internal subteams for each user created team.
# Fixed several minor bugs and memory leaks.

===============================================================================
				Changes in 2.11.0
===============================================================================

# Added support for Multi-node NVLink (MNNVL) systems when all nodes are
connected via NVLink
# Added support for multiple NICs per PE in IBGDA transport. It can be enabled
using NVSHMEM_IBGDA_ENABLE_MULTI_PORT runtime environment variable.
# Added support for sockets-based bootstrapping of NVSHMEM jobs through the Unique ID based initialization API
# Added nvshmemx_hostlib_init API that allows NVSHMEM host library only initialization.
This is useful for applications that only use NVSHMEM host API and need not
statically link NVSHMEM device library.
# Added support for dynamically linking NVSHMEM library through dlopen()
# Introduces a new nvshmemx_vendor_get_version_info API to query the
NVSHMEM_VENDOR_MAJOR_VERSION, NVSHMEM_VENDOR_MINOR_VERSION, NVSHMEM_VENDOR_PATCH_VERSION
for API consumers
# Added NVSHMEM_IGNORE_CUDA_MPS_ACTIVE_THREAD_PERCENTAGE runtime environment variable
to get the full API support with Multi-Process per GPU (MPG) runs even if
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE is not set to 1/PEs.
# Improved throughpout and bandwidth performance of IBGDA transport
# Fixed hang that was introduced in CUDA VMM path on DGX1V systems in NVSHMEM 2.10.1
# Improved performance of nvshmemx_quiet_on_stream() API with IBGDA transport
by leveraging multiple CUDA threads to perform IBGDA quiet operation
# Fixed hang with minimal proxy service in nvshmem_global_exit on Grace Hopper system due to memory reordering
of load/stores
# Enable relaxed ordering by default for InfiniBand transports. Added runtime
environment variable NVSHMEM_IB_ENABLE_RELAXED_ORDERING to disable it.
# Increased number of threads launched to execute nvshmemx_<typename>_<op>_reduce_on_stream() API
# Added runtime environment variable NVSHMEM_DISABLE_DMABUF to disable use of dmabuf
# Fix in IBGDA transport when doing very large message transfers beyond the maximum
size supported by a single NIC work request
# Fixed several minor bugs and memory leaks

===============================================================================
				Changes in 2.10.1
===============================================================================

# Support for single and multi-node Grace Hopper systems
# Support for the EFA provider using the libfabric transport, which can be
enabled with NVSHMEM_LIBFABRIC_PERSONA=EFA
# NVRTC support was added for the NVSHMEM device implementation headers.
# Fixed memory leaks in nvshmem_finalize
# Added support for calling nvshmem_init and nvshmem_finalize in a loop with
any bootstrap. Previously the support had existed only for MPI bootstrap
# Performance optimizations in Alltoall collective API
# Implemented warp-level automated coalescing of nvshmem_<typename>_g
operations to contiguous addresses in IBGDA transport
# Removed redundant consistency operations in IBGDA transport
# Added support for synchronized memory operations when using VMM API for NVSHMEM symmetric heap
# Code refactoring to improve host and device library ABI interface
# Several bug fixes

===============================================================================
				Changes in 2.9.0
===============================================================================

# Improvements to CMake build system. CMake is now the default build system and
the Makefile build system is deprecated.
# Added loadable network transport modules.
# NVSHMEM device code can now be inlined to improve performance by enabling
NVSHMEM_ENABLE_ALL_DEVICE_INLINING when building the NVSHMEM library.
# Improvements to collective communication performance.
# Updated libfabric transport to fragment messages larger than the maximum
length supported by the provider.
# Improvements to IBGDA transport, including large message support, user buffer
registration, blocking g/get/amo performance, CUDA module support, and several
bugfixes.
# Introduced ABI compatibility for bootstrap modules. This release is
backawards compatible with the ABI introduced in NVSHMEM 2.8.0.
# Added NVSHMEM_BOOTSTRAP_*_PLUGIN environment variables that can be used to
override the default filename used when opening each bootstrap plugin.
# Improved error handling for GDRCopy.
# Added a check to detect when the same number of PEs is not run on all nodes.
# Added a check to detect availability of nvidia_peermem kernel module.
# Reduced internal stream synchronizations to fix a compatibility bug with CUDA
graph capture.
# Fixed a data consistency issue with CUDA graph capture support.

===============================================================================
				Changes in 2.8.0
===============================================================================

# The transport formerly called GPU Initiated Communication (GIC) has been
renamed to InfiniBand GPUDirect Async (IBGDA) to reflect the underlying
technology used by that transport.
# Improvements to the all-to-all algorithm were made for both the IBGDA and
IBRC transports. These changes specifically focused on latency bound all-to-all
operations.
# Support for RC connections was added to IBGDA to optimize workloads on small
PE sets.
# Fixed an issue in the IBGDA Transport which caused all GPUs on the same host
to use the same NIC.
# Fixed an issue in the DMA-BUF registration path. Users no longer need to
limit their allocation granularity to 4GiB when using DMABUF.

===============================================================================
				Changes in 2.7.0
===============================================================================

# Added experimental CMake build system that will replace the Makefile in a
future release
# Updated GPU Initiated Communication (GIC) transport provides significant
performance improvements over NVSHMEM 2.6.0
# Added NVSHMEM version checks to ensure that the dynamically linked NVSHMEM
host library is compatible with the statically linked device library. Also
added compatibility checks for the inbuilt bootstrap plugins.
# Added support for CUDA minor version compatibility, which allows NVSHMEM
application binaries built with CUDA M.X to run with M.Y, where M is the
major version and X and Y are compatible minor versions
# NVSHMEM library now statically links libcudart_static.a and dlopens libcuda.so
# Improved timing in NVSHMEM performance tests to reduce noise in measurements
# Added support for Hopper compute_90 and sm_90
# Removed support for Pascal compute_60, sm_60, compute_61, and sm_61
# Added version number suffix to libnvshmem_host.so and bootstrap plugins
# Added support for dmabuf memory registration
# Updated Hydra installation script to install Hydra 4.0.2
# Added a pre-built Hydra launcher to NVSHMEM binary packages.
# Catch user buffer registration error when requested buffer overlaps with an
already registered memory region
# An issue causing validation errors in collective operations when all GPUs
in a job are connected via PCIe without a remote transport using the proxy
thread was fixed.


===============================================================================
				Changes in 2.6.0
===============================================================================

# Added new GPU initiated communication transport that allows kernel initiated
communication to be issued directly to the NIC and bypass the CPU proxy thread.
The transport is currently provided in experimental mode. It is disabled by default.
Please refer to installation guide for how to enable it.
# Updated the libfabric transport with initial support for Slingshot-11 networks.
Performance tuning for the libfabric transport is ongoing.
# Added collective algorithms for bcast/fcollect/reduce that use low latency (LL)
optimization by sending data and synchronization together, resulting in
significant performance improvements.
# Added warp- and block-scope implementation of recursive exchange algorithm for
reduce collectives
# Fixed bug in host/on-stream RMA API for very large data transfers
# Fixed bug in implementation of nvshmem_fence and nvshmemx_quiet_on_stream API

===============================================================================
				Changes in 2.5.0
===============================================================================

# Added multi-instance support in NVSHMEM. NVSHMEM now builds as two libraries,
libnvshmem_host.so and libnvshmem_device.a, making it possible for an
application to have multiple components (for example, shared libraries,
application itself) that use NVSHMEM. Support for single library, libnvshmem.a,
still exists for legacy purposes but will be eventually removed.
# Added nvshmemx_init_status API to query the initialized state of NVSHMEM
# Added experimental DevX transport that directly uses Mellanox software stack
for InfiniBand devices
# Added experimental libfabric transport that will be used to support Slingshot
networks in a future release
# Added support for CUDA_VISIBLE_DEVICES. Support for CUDA_VISIBLE_DEVICES is
not yet available with CUDA VMM and requires setting NVSHMEM_DISABLE_CUDA_VMM=1.
# Updated PMI and PMI-2 bootstraps to plugins
# Added nvshmem-info utility to display information about the NVSHMEM library
# Fixed warnings when using NVSHMEM in applications compiled without RDC
(Relocatable Device Code) option
# Renamed internal variables to avoid potential conflicts with variables in
application
# Implemented nvshmem_alltoallmem API
# Improve GPU to NIC assignment logic for Summit/Sierra supercomputer
# Fixed host barrier API implementation for non-blocking on stream (*_nbi_on_stream)
point-to-point operations
# Updated descriptions for NVSHMEM environment variables displayed via
nvshmem-info or by setting NVSHMEM_INFO=1

===============================================================================
				Changes in 2.4.1
===============================================================================

# Added limited support for Multiple Processes per GPU (MPG) on x86 platforms.
The amount of support depends on availability of CUDA MPS. MPG support is
currently not available on P9 platforms.
# Added a local buffer registration API that allows non-symmetric buffers to be
used as local buffers in NVSHMEM API.
# Added support for dynamic symmetric heap allocation, which eliminates the need
to specify NVSHMEM_SYMMETRIC_SIZE.
This feature is available with CUDA >= 11.3 and is enabled by default on x86
platforms. On P9 platforms, it is disabled by default, and can be enabled using
NVSHMEM_CUDA_DISABLE_VMM environment variable
# Support for very large RMA messages has been added
# NVSHMEM can now be built without ibrc support by setting NVSHMEM_IBRC_SUPPORT=0
in the environment before building.
This allows users to build and run NVSHMEM without the GDRCopy and OFED dependencies.
# Support for calling nvshmem_init/finalize multiple times with MPI bootstrap
# Improved testing coverage (large messages, exercising full GPU memory, and so on)
# Improved the default PE to NIC assignment for DGX2 systems
# Optimized channel request processing by CPU proxy thread
# Added support for the shmem_global_exit API
# Removed redundant barriers to improve the collectives’ performance
# Significant code refactoring to use templates instead of macros for internal
functions
# Improved performance for device-side blocking RMA and strided RMA API
# Bug fix for buffers with large offsets into the NVSHMEM symmetric heap

===============================================================================
				Changes in 2.2.1
===============================================================================

# Implemented dynamic heap memory allocation (requires CUDA version >= 11.3) for
runs with P2P GPUs. It can be enabled using NVSHMEM_DISABLE_CUDA_VMM=0. Support
for IB runs will be added in the next release.
# Improved UCX transport performance for AMO and RMA operations
# Improved performance for warp and block put/get operations
# Added atomic support for PCIe connected GPUs over the UCX transport
# UCX transport now supports non-symmetric buffers for use as local buffers
in RMA and AMO operations
# Added support for initializing NVSHMEM in CUmodule
# Enabled MPI and PMIx bootstrap modules to be compiled externally from the
NVSHMEM build. This allows multiple builds of these plugins to support various
MPI and PMIx libraries. They can be selected by setting NVSHMEM_BOOTSTRAP="plugin"
and NVSHMEM_BOOTSTRAP_PLUGIN="plugin_name.so". Plugin sources are installed along
with the compiled NVSHMEM library.
# Enabled MPI bootstrap to be used with nvshmem_init by setting
NVSHMEM_BOOTSTRAP=MPI or via the bootstrap plugin method.
# Fixed bugs in nvshmem_<typename>_g and fetch atomics implementation
# Changed nvshmem_<typename>_collect to nvshmem_<typename>_fcollect to match
OpenSHMEM specification
# Fixed type of nreduce argument in reduction API to size_t to match OpenSHMEM
specification
# Improved NVSHMEM build times with multi-threaded option in CUDA compiler
(requires CUDA version >= 11.2)
# Several fixes to address Coverity reports

===============================================================================
				Changes in 2.1.2
===============================================================================

# Added a new, experimental UCX internode communication transport layer
# Added support for automatic warp-level coalescing of nvshmem_g operations
# Added support for put-with-signal operations on CUDA streams
# Added support for mapping the symmetric heap using the cuMem APIs
# Improved performance of single-threaded NVSHMEM put/get device API
# Added the NVSHMEM_MAX_TEAMS environment variable to specify maximum number
of teams that can be created
# Improved the host and on-stream Alltoall performance by using NCCL
# Fixed a bug in the compare-and-swap operation that caused several bytes of the
compare operand to be lost
# Added CPU core affinity to debugging output
# Added support for the CUDA 11.3 cudaDeviceFlushGPUDirectRDMAWrites API for consistency
# Improved support for the NVIDIA Tools Extension (NVTX) to enable performance
analysis through NVIDIA NSight
# Removed support for nvshmem_wait API that has been deprecated in OpenSHMEM 1.5
# Removed NVSHMEM_IS_P2P_RUN environment variable, runtime automatically determines it
# Made improvements to NVSHMEM example codes
# Added NVSHMEM_REMOTE_TRANSPORT environment variable for selecting the networking
layer used for communication between nodes
# Set maxrregcount to 32 for non-inlined device functions to ensure that calling
these NVSHMEM functions does not negatively affect kernel occupancy

===============================================================================
				Changes in 2.0.3
===============================================================================

# Added work-around to avoid deadlocks due to CUDA context resource reconfiguration
on Power systems
# Added environment variable NVSHMEM_CUDA_LIMIT_STACK_SIZE to set GPU thread stack size
on Power systems
# Use of NCCL for stream and host NVSHMEM collectives is now supported on Power systems
# Updated threading level support reported for host and stream-based APIs
to NVSHMEM_THREAD_SERIALIZED. Device-side APIs support NVSHMEM_THREAD_MULTIPLE
# Fixed a bug that could lead to incorrect behavior for atomic compare-and-swap
# Fixed an issue that was observed to lead to incorrect results when using GDRCopy

===============================================================================
				Changes in 2.0.2 EA
===============================================================================

# Added the teams and team-based collectives APIs from OpenSHMEM 1.5.
# Added support to use the NVIDIA Collective Communication Library (NCCL) for
optimized NVSHMEM host and on-stream collectives.
# Added support for RDMA over Converged Ethernet (RoCE) networks.
# Added support for PMI-2 to enable an NVSHMEM job launch with srun/SLURM.
# Added support for PMIx to enable an NVSHMEM job launch with PMIx-compatible
launchers, such as Slurm and Open MPI.
# Uniformly reformatted the perftest benchmark output.
# Added support for the putmem_signal and signal_wait_until APIs.
# Improved support for single-node environments without InfiniBand.
# Fixed a bug that occurred when large numbers of fetch atomic operations were
performed on InfiniBand.
# Improved topology awareness in NIC-to-GPU assignments for DGX A100 systems.

===============================================================================
				Changes in 1.1.3
===============================================================================

# Implements nvshmem_<type>_put_signal API from OpenSHMEM 1.5
# Adds nvshmemx_signal_op API
# Optimizes implementation of signal set operation over P2P connected GPUs
# Optimizes performance of nvshmem_fence() function
# Optimizes latency of NVSHMEM atomics API
# Fixes bug in nvshmem_ptr API
# Fixes bug in implementation of host-side strided transfer (iput, iget, etc.) API
# Fixes bug in on-stream reduction for `long long` datatype
# Fixes hang during nvshmem barrier collective operation
# Fixes __device__ nvshmem_quiet() to also do quiet on IB ops to self
# Fixes bug in fetch atomic and g implementation

===============================================================================
				Changes in 1.0.1
===============================================================================

# Combines the memory of multiple GPUs into a partitioned global address space
that’s accessed through NVSHMEM APIs.
# Includes a low-overhead, in-kernel communication API for use by GPU threads.
# Includes stream-based and CPU-initiated communication APIs.
# Supports peer-to-peer communication using NVIDIA NVLink and PCI Express and for
GPU clusters using NVIDIA Mellanox® InfiniBand.
# Supports x86 and POWER9 processors.
# Is interoperable with MPI and other OpenSHMEM implementations.