Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add out-of-bounds memory checking for GPUMem buffers and sub-buffers #3666

Merged
merged 13 commits into from
Apr 8, 2025

Conversation

JonathanLichtnerAMD
Copy link
Contributor

@JonathanLichtnerAMD JonathanLichtnerAMD commented Apr 1, 2025

This PR adds two different detection mechanisms for detecting out-of-bounds GPU buffer
memory accesses:

  1. The first adds a new argument to MIOPenDriver, --gpubuffer_check , which can be used to make the GPUMem class allocate GPU buffers that are aligned to 2 MiB alignment boundaries. Specifying 1 for this option causes the client memory to be allocated at the start of the buffer (to detect OOBs before the buffer's start) and specifying 2 for this option causes the client memory to be aligned at the end of the buffer (to detect OOBs after the buffer's end). Note that this is currently enabled for convolutions and batchnorm, although it is simple to add it for other problems. (We might also want to consider turning this checking on by default since this only affects MIOpenDriver usage.)
  2. The second method detects OOBs memory accesses for sub-buffers. This can be enabled via the MIOPEN_DEBUG_CHECK_SUB_BUFFER_OOB_MEMORY_ACCESS environment variable, and uses a technique similar to that used for the GPUMem class (but adapted for sub-buffers).

This commit implements out-of-bounds (OOBs) memory detection for
MIOPen GPUMem convolution buffers.

This detection relies on the fact that the GPU will raise a memory
fault if an access into GPU buffers crosses a 2 MiB alignment
boundary *and* the memory adjacent to the boundary is not allocated, i.e.,
if the gpu memory is aligned to a 2 MiB page size boundary and the
memory access is either before the start of the buffer or after the
end of the buffer.

To make the problem more likely to occur, the memory returned to the
client is either allocated at the start of the buffer (to catch issues
before the start of the buffer) or at the end of the buffer (to catch
issues at the end of the buffer).

This detection is off by default, but can be enabled via
the MIOpenDriver --gpubuffer_check flag which takes the following
values:

    0 - no out-of-bound detection (default)
    1 - align the user-memory to the left (detect OOB before the start
        of the buffer)
    2 - align the user-memory to the right (detect OOB after the end
        of the buffer)

Note that this initial commit only enables this detection for
convolutions.
This commit implements out-of-bounds memory detection for
MIOpen sub-buffers, which are buffers created from the workspace; that
is, the workspace is chopped into logical sub-buffers that can then be
used by the kernels.

Note that the previous detection methods might not detect invalid uses
of sub-buffers, since these sub-buffers were still part of the
workspace, and so going over the edge of a sub-buffer might be hidden
by the fact that the access was *into* a different sub-buffer.  This
is still an issue, but would not cause a GPU memory fault.

This commit introduces a new environment variable,
MIOPEN_DEBUG_CHECK_SUB_BUFFERS, which can be used to enable
out-of-bound memory access detection for sub-buffers.  This is done by
hipMalloc'ing each sub-buffer (in CreateSubBuffer()) and then aligning
the sub-buffer to either the left or the right of the 2 MiB aligned
pages.

If the environment variable is undefined or set to zero, then no
sub-buffer out-of-bounds detection is done. Otherwise, setting the
environment variable to 1 aligns the buffer to the left, which will
detect out-of-bounds before the start of the sub-buffer, and setting
the environment variable to 2 aligns the buffer to the right, which
detects out-of-bounds at the end of the sub-buffer.

Note that this is intended for internal debugging only.
Copy link
Contributor

@cderb cderb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JonathanLichtnerAMD JonathanLichtnerAMD requested a review from a team as a code owner April 2, 2025 23:59
Copy link
Contributor

@amd-jnovotny amd-jnovotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of suggestions for the docs.

Copy link
Contributor

@averinevg averinevg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JonathanLichtnerAMD JonathanLichtnerAMD merged commit 20ddfa7 into develop Apr 8, 2025
36 of 82 checks passed
@JonathanLichtnerAMD JonathanLichtnerAMD deleted the jlichtne/memfault-memcheck branch April 8, 2025 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants