ptx: Retain arrays until kernel is done executing#111
Merged
ivogabe merged 4 commits intoAccelerateHS:masterfrom Nov 26, 2025
Merged
ptx: Retain arrays until kernel is done executing#111ivogabe merged 4 commits intoAccelerateHS:masterfrom
ivogabe merged 4 commits intoAccelerateHS:masterfrom
Conversation
35991a9 to
a2f8727
Compare
The monad was originally a State monad, but only 'get' was ever used. This is because the state is actually in mutable variables (MVars and IORefs) inside the monad state. Changing the monad to a Reader monad is 1. more accurate, 2. allows parallelism, and 3. allows unlifting the LLVM monad into IO, which is helpful (see the next commit).
This fixes a bug where on my GTX 1050 Ti, adbench-gmmgrad in
accelerate-tests nondeterministically returns incorrect results in the
following way:
Correct:
(Vector (Z :. 5) [-1.9073486e-3,-1.9073486e-3,-1.9073486e-3,-1.9073486e-3,-1.9073486e-3],Matrix (Z :. 5 :. 2)
[ 0.0, 0.0,
0.0, 0.0,
0.0, 0.0,
0.0, 0.0,
0.0, 0.0],Matrix (Z :. 5 :. 3)
[ 200.13342, 200.13342, -1.0,
200.13342, 200.13342, -1.0,
200.13342, 200.13342, -1.0,
200.13342, 200.13342, -1.0,
200.13342, 200.13342, -1.0])
Produced by PTX:
(Vector (Z :. 5) [-4.5776367e-5,-4.5776367e-5,-4.5776367e-5,-4.5776367e-5,-4.5776367e-5],Matrix (Z :. 5 :. 2)
[ 6.770671, 6.770671,
6.770671, 6.770671,
6.770671, 0.0,
0.0, 0.0,
0.0, 0.0],Matrix (Z :. 5 :. 3)
[ 200.13528, 200.13528, -1.0,
200.13528, 200.13528, -1.0,
200.13528, 200.13528, -1.0,
200.13528, 200.13528, -1.0,
200.13528, 200.13528, -1.0])
The differences in the first vector are likely just floating point
reordering inaccuracies, but the second result is complete bogus.
a2f8727 to
fd3949d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The remote memory table implementation in Accelerate was not used properly in the PTX backend; the result was that arrays were marked as unused (and thus eligible for freeing or evicting from GPU memory) before the kernel that used those arrays actually ran. It seems (see below in the testing section) that this resulted in unsoundness in practice where a program returned an incorrect result without any signal or error indicating something was off.
This PR fixes this problem. However, it does so in a somewhat hacky way in order to avoid having to rearchitect the entire PTX backend. See the comment above
getCudaDevicePtrhere for more details on the design.Motivation and context
accelerate-llvm/accelerate-llvm-ptx/src/Data/Array/Accelerate/LLVM/PTX/Execute/Marshal.hs
Lines 51 to 55 in c469554
How has this been tested?
The existing full testsuite passes, as well as now (for the first time!) all 4 tests in the current version of accelerate-tests.
The only reproducer we had for this problem, despite its seeming gravity, was a very large unreadable program that gave incorrect results sometimes on my GTX 1050 Ti (which isn't even supported any more by Cuda 13). The only solace is that especially with a smaller nursery (and thus more frequent GCs) such as
+RTS -A1M, the problem reproduced very often, almost every run.I have not fully ascertained that this PR really fixes all unsoundness issues in that adbench-gmmgrad program, nor have I really ascertained that this PR is fully sound. However, I believe that the PR is indeed sound, and the adbench-gmmgrad program does reliably fail before this PR and succeed after this PR. It seems something, at least, improves with this PR.
The nice thing is that after this PR, I do not know of any reproducible unsoundness issues any more in Accelerate.
Types of changes
Checklist: