From 7d5434989e75262a7b681d6a9699f05f98e315a0 Mon Sep 17 00:00:00 2001 From: Ben Wibking Date: Fri, 26 Sep 2025 15:51:18 -0400 Subject: [PATCH 1/2] add docs about floating-point reproducibility --- .../source/Nondeterminism.rst | 88 +++++++++++++++++++ Docs/sphinx_documentation/source/index.rst | 1 + 2 files changed, 89 insertions(+) create mode 100644 Docs/sphinx_documentation/source/Nondeterminism.rst diff --git a/Docs/sphinx_documentation/source/Nondeterminism.rst b/Docs/sphinx_documentation/source/Nondeterminism.rst new file mode 100644 index 00000000000..498231f9cc5 --- /dev/null +++ b/Docs/sphinx_documentation/source/Nondeterminism.rst @@ -0,0 +1,88 @@ +.. _sec:nondeterminism: + +Floating-Point Nondeterminism +============================= + +AMReX executes most kernels in parallel on CPUs, GPUs, and across MPI ranks. +Because floating-point addition and multiplication are not associative, the +final rounding depends on the precise order in which partial updates are +executed. This section summarizes the places in the AMReX source that rely on +atomics, reductions, or collective communication whose scheduling can vary from +run to run, leading to small changes in floating-point answers even when inputs +match. + +Atomic accumulations in FAB-based operations +------------------------------------------- + +* ``BaseFab::atomicAdd`` and related helpers allow multiple tiles to accumulate + into the same FAB in parallel (:file:`Src/Base/AMReX_BaseFab.H`). On the GPU + path the implementation goes through ``amrex::Gpu::Atomic::AddNoRet`` while + the host path uses ``#pragma omp atomic`` in + :file:`Src/Base/AMReX_GpuAtomic.H`. The interleaving of those atomic updates + depends on the thread and block schedule, so repeated runs can round the + total differently. +* ``FabArray`` communication utilities rely on the same atomics when unpacking + overlapping data. For example ``FabArray::ParallelCopy`` calls + ``Gpu::Atomic::AddNoRet`` while combining receive buffers in + :file:`Src/Base/AMReX_FBI.H`. When multiple senders contribute to the same + cell, the order of those additions is not deterministic. +* Embedded-boundary redistribution routines, such as + ``EB_StateRedistribute`` and ``EB_StateRedistUtils``, update conservative + quantities with GPU atomics (:file:`Src/EB/AMReX_EB_StateRedistribute.cpp`, + :file:`Src/EB/AMReX_EB_StateRedistUtils.cpp`). The choice of CUDA/HIP warp + scheduling determines the summation order. + +Particle to mesh deposition +--------------------------- + +* Particle interpolation and charge deposition kernels (e.g. + ``amrex_deposit_cic``) perform ``Gpu::Atomic::AddNoRet`` into nodal or cell + centered fields (:file:`Src/Particle/AMReX_Particle_mod_K.H`). The host path + falls back to ``BaseFab::atomicAdd`` + (:file:`Src/Particle/AMReX_ParticleContainerI.H`, + :file:`Src/Particle/AMReX_ParticleMesh.H`). As particles are processed in + parallel, the sequence of adds depends on launch configuration or OpenMP + scheduling. +* Higher-order particle-mesh operators reuse atomics as they stencil deposit; + ``ParticleInterpolators::ParticleToMesh`` invokes ``Gpu::Atomic::AddNoRet`` + for each weighted contribution (:file:`Src/Particle/AMReX_ParticleInterpolators.H`). + Communication stages reuse the same buffers, so overlapping deposits inherit + the non-deterministic order. + +Parallel reductions and norms +----------------------------- + +* ``ReduceOps`` / ``ReduceData`` implement parallel norms and sums for + FAB-based data (:file:`Src/Base/AMReX_Reduce.H`, + :file:`Src/Base/AMReX_ParReduce.H`). Each reducer combines per-thread and + per-block partial results in an order that depends on the execution backend + (CUDA blocks, HIP wavefronts, or OpenMP threads). Calls such as + ``MultiFab::sum``, ``MultiFab::norm0`` and particle reductions inherit that + behavior. +* ``HostDevice::Atomic::Add`` is used internally when reductions fall back to a + host implementation, again relying on ``#pragma omp atomic`` in + :file:`Src/Base/AMReX_GpuAtomic.H`. + +MPI collective reductions +------------------------- + +* ``ParallelDescriptor::ReduceReal*`` functions use ``MPI_Allreduce`` or + ``MPI_Reduce`` (:file:`Src/Base/AMReX_ParallelDescriptor.H`). MPI libraries + are free to reorder partial sums while building their reduction trees, and + the order can change with communicator topology, process pinning, or MPI + version. Cross-rank sums, norms, and max/min operations therefore vary at + the level of machine epsilon even when each rank performs a deterministic + local reduction. + +Practices for reproducibility +----------------------------- + +* Fix the execution configuration: run with a single MPI rank or a fixed MPI + decomposition, disable OpenMP when possible, and use deterministic GPU launch + parameters to minimize variability. +* Prefer deterministic algorithms (e.g., serial accumulation or Kahan/compensated + summation) in application code where bitwise reproducibility is required. +* Capture reference answers with the same build options, compiler, and hardware + whenever regression tests depend on floating-point tolerances. +* Document the expected tolerance in tests and diagnostics; accept that atomic + updates can differ by several ulps unless a reproducible algorithm is used. diff --git a/Docs/sphinx_documentation/source/index.rst b/Docs/sphinx_documentation/source/index.rst index 09ffbb5c0b4..794ee3f4d6b 100644 --- a/Docs/sphinx_documentation/source/index.rst +++ b/Docs/sphinx_documentation/source/index.rst @@ -58,6 +58,7 @@ Documentation on migration from BoxLib is available in the AMReX repository at D Visualization_Chapter Post_Processing_Chapter Debugging + Nondeterminism RuntimeParameters AMReX_Profiling_Tools_Chapter External_Profiling_Tools_Chapter From ad1a072b77c5532877e86699f773f8042a782a2e Mon Sep 17 00:00:00 2001 From: Ben Wibking Date: Fri, 26 Sep 2025 15:57:23 -0400 Subject: [PATCH 2/2] mention flux registers --- .../source/Nondeterminism.rst | 20 +++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/Docs/sphinx_documentation/source/Nondeterminism.rst b/Docs/sphinx_documentation/source/Nondeterminism.rst index 498231f9cc5..e472e30dcb2 100644 --- a/Docs/sphinx_documentation/source/Nondeterminism.rst +++ b/Docs/sphinx_documentation/source/Nondeterminism.rst @@ -32,6 +32,26 @@ Atomic accumulations in FAB-based operations :file:`Src/EB/AMReX_EB_StateRedistUtils.cpp`). The choice of CUDA/HIP warp scheduling determines the summation order. +AMR flux registers +------------------ + +* ``YAFluxRegister`` employs ``HostDevice::Atomic::Add`` while accumulating + fine fluxes back onto coarse faces (:file:`Src/Boundary/AMReX_YAFluxRegister_3D_K.H`). + The atomic updates allow multiple fine patches to update the same coarse cell + concurrently, so the final ordering depends on GPU thread/block scheduling or + OpenMP interleaving. ``EBFluxRegister`` inherits this implementation and + introduces additional atomic accumulations when redistributing embedded + boundary fluxes (:file:`Src/EB/AMReX_EBFluxRegister_3D_C.H`). +* The legacy ``FluxRegister`` implementation coarsens fine fluxes inside a + single thread before writing to the coarse register, avoiding atomic + operations (:file:`Src/AmrCore/AMReX_FluxReg_3D_C.H`). Likewise, + ``EdgeFluxRegister`` (used for constrained-transport electric fields) loops + over all fine contributions within the thread that owns the coarse edge, so + its updates are deterministic (:file:`Src/Boundary/AMReX_EdgeFluxRegister.cpp`). + The Fortran-facing ``FlashFluxRegister`` follows the same pattern and keeps + its reductions local to the thread that owns the coarse face + (:file:`Src/F_Interfaces/AmrCore/AMReX_FlashFluxRegister.H`). + Particle to mesh deposition ---------------------------