Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fence delay options at capture time #1155

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

marius-pelegrin-arm
Copy link
Contributor

@marius-pelegrin-arm marius-pelegrin-arm commented Jun 20, 2023

Add two capture options:

  • Fence Query Delay
  • Fence Query Delay Unit
  • Fence Query Timeout Threshold

When capturing with intent to replay on a "slow" platform, we may want to force the captured application to adopt a "slow platform behavior" - for example, not re-using some resources because we know that at replay time command buffers will take a lot longer to be executed.

To emulate this, this commit implement the notion of "fence query delay". The idea is that when the captured application queries a VkFence using vkGetFenceStatus or vkWaitForFences with timeout under the "timeout threshold", we transmit the call to the driver, but if the fence is ready, we do not return VK_SUCCESS directly. Instead, we return VK_NOT_READY for "a certain amount of time" specified by the "delay" and "delay unit" capture options.

By default this amount of time is 0 call, which means no delay at all: we give to the application the result from the driver directly. This amount of time can be specified in number of calls or in number of frames. The unit (calls/frames) is specified by fence query delay unit, and the number by fence query delay.

In addition, this commit also fixes the --sgfs and --sgfr options that were not parsed correctly, and the behavior of vkGetFenceStatus and vkWaitForFences at replay time to not "spam" vkGetFenceStatus
if the call was successful at capture time but not at replay time.

@ci-tester-lunarg
Copy link

Author arm-marius-pelegrin not on autobuild list. Waiting for curator authorization before starting CI build.

@CLAassistant
Copy link

CLAassistant commented Jun 20, 2023

CLA assistant check
All committers have signed the CLA.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 20199.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2873 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2873 passed.

@bradgrantham-lunarg bradgrantham-lunarg added the P2 A high-priority code maintenance issue or a functional problem that is recoverable or not a crash. label Oct 30, 2023
@charles-lunarg
Copy link
Contributor

Less of a question of implementation and more of design.

The purpose of the fence delay is to force the application being recorded to issue multiple vkGetFenceStatus calls, that way during replay if a single vkGetFenceStatus call returns VK_NOT_READY, the subsequent ones may succeed?

This feels like a non-deterministic solution because there is no guarantee that the number of queries is sufficient, and because it is encoded directly into the capture, it doesn't have the ability to arbitrarily repeat calls vkGetFenceStatus.

Looking at VulkanReplayConsumerBase::OverrideGetFenceStatus, I see that the replay will retry calling vkGetFenceStatus if the return code isn't VK_SUCCESS or VK_NOT_READY.

Would it make more sense to loop until the return code is VK_SUCCESS with a possible "max retry count"? My thought is to not modify the capture file at all, and rather make replay handle slow devices more gracefully.

@per-mathisen-arm
Copy link
Contributor

Some applications use vkGetFenceStatus to query when a resource can be reused. If this reuse happens in the same frame as the original usage, then on replay the resource might not yet be ready for reuse by the time we get to this point in time again. As you point out, the replayer will in this case wait for the call to return VK_SUCCESS, and that is normally fine. However, on slower platforms, this wait changes the workload dramatically, and makes it unusable for performance investigations. The purpose of this patch is to try to defeat such over-eager resource reuse - if the app does a one-off status check it will be rebuffed, while if it does a busy loop until ready it will quickly bypass this mechanism thus avoiding hangs.

@bradgrantham-lunarg
Copy link
Contributor

@marius-pelegrin-arm @per-mathisen-arm I'm reluctant to merge this to dev. One of our principles is to store the application's Vulkan calls as close to verbatim as possible. If someone runs convert on the output of capturing with this PR with this option enabled, the recorded commands will look like the capture had to Get or Wait a bunch of times on the captured GPU, but that wasn't actually the case.

We do modify the stream if necessary, like for #1160. There would otherwise be missing information from the capture that would prevent replay. We have attempted otherwise to store the app's calls verbatim, and then (as you probably know) add additional query data or system information as Metadata Commands.

I don't really understand how the situation describing makes the replay unusable for performance investigations because on replay it looks like we don't create more Waits or Gets. If you mean wall-clock performance, we don't make any statements about that anyway. We do try to make the replay GPU workload (e.g. shader core operations) as similar as possible to the captured workload - is replay not doing that for Fences? I would consider that a bug.

Another possibility is to just add a delay of some kind in replay before calling Wait or Get if that's important on a very slow emulation architecture. It's much more acceptable to alter the stream on replay, since that's after recording the original application's stream in the file, and we know replay is used on wildly different environments from capture.

If you really want to force the app to perform multiple Query or Wait calls in anticipation of running on a GPU emulation or simulation environment which has a very different performance level from the CPU, I think what you're discussing feels like an emulation to apply to the GPU being captured.

If a user wanted to limit a GPU's extensions or features, we'd tell that user to use the Profiles layer to do that, rather than implementing in GFXR capture layer. I think this is the same category of functionality. I think if you want to force timing behavior dependent on CPU code it's better suited to a layer in which you can specify things like delays. @per-mathisen-arm we can talk directly about that if desired.

@charles-lunarg
Copy link
Contributor

Wanted to add a note that this PR would need to be amended to support the changes made in #1265, namely that any capture options added by this PR need to be accounted for in the logic that keeps track of non-default capture options.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 122282.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3690 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 122320.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3692 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3692 failed.

@bradgrantham-lunarg
Copy link
Contributor

bradgrantham-lunarg commented Jan 25, 2024

3692 appears to have been an internal CI machine failure and will be re-run.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 122458.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3695 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3695 passed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 258768.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 4852 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 258783.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 4853 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 258801.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 4854 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 4854 failed.

@lunarpapillo
Copy link
Contributor

The LunarG CI failures seem to be due to a crashed Pixel device. Restarting.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 259009.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 4856 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 4856 passed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 259632.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 4865 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 4865 failed.

@lunarpapillo
Copy link
Contributor

Last LunarG CI failure seemed to be due to X server crash. Re-running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 260150.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 4874 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 4874 passed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 336319.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 5682 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 336336.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 5683 running.

Add three capture options:
- Fence Query Delay
- Fence Query Delay Unit
- Fence Query Timeout Threshold

When capturing with intent to replay on a "slow" platform, we may
want to force the captured application to adopt a "slow platform
behavior" - for example, not re-using some resources because we
know that at replay time command buffers will take a lot longer
to be executed.

To emulate this, this commit implement the notion of "fence query
delay". The idea is that when the captured application queries a
`VkFence` using `vkGetFenceStatus` or `vkWaitForFences` with `timeout`
under the "timeout threshold", we transmit the call to the driver,
but if the fence is ready, we do not return `VK_SUCCESS` directly.
Instead, we return `VK_NOT_READY` for "a certain amount of time"
specified by the "delay" and "delay unit" capture options.

By default this amount of time is `0 call`, which means no delay at
all: we give to the application the result from the driver directly.
This amount of time can be specified in number of `calls` or in
number of `frames`. The unit (calls/frames) is specified by fence
query delay unit, and the number by fence query delay.

In addition, this commit also fixes the `--sgfs` and `--sgfr` options
that were not parsed correctly, and the behavior of `vkGetFenceStatus`
and `vkWaitForFences` at replay time to not "spam" `vkGetFenceStatus`
if the call was successful at capture time but not at replay time.

Change-Id: I9adff58c364b6a08fde2a95502e3b79152e1cbdf
@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 336356.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 5684 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 5684 passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 A high-priority code maintenance issue or a functional problem that is recoverable or not a crash.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants