Merged PR 750781: Fix retry logic for low disk space

semihokur · narasamdya · commit 325c30c95d70 · 2023-11-09T13:45:03.000-08:00
When there is a low disk space, we cancel PipQueue and running pips by triggering SchedulerCancellationToken. The pips get cancelled and they have SandboxedProcessPipExecutionStatus.Canceled. However, we do not set RetryInfo for those pips, so Canceled sandboxed result is translated to PipResultStatus.Failed due to the lack of RetryInfo. The orchestrator receives pip results with Failed with no error logs, so we log DistributionPipFailedOnWorker errors.

We should have set RetryInfo for those cancelled pips. Because we check environment.Context.CancellationToken instead of SchedulerCancellationToken, we skip setting RetryInfo. Context.CancellationToken is triggered when CTRL-C is pressed. SchedulerCancellationToken is triggered when we request termination in Scheduler.

Related work items: #2121638
diff --git a/Public/Src/Engine/Scheduler/PipExecutor.cs b/Public/Src/Engine/Scheduler/PipExecutor.cs
@@ -2215,7 +2215,7 @@ private static async Task<SandboxedProcessPipExecutionResult> ExecutePipAndHandl
                             expectedCommitMb: expectedMemoryCounters.PeakCommitSizeMb,
                             cancelMilliseconds: (int)(cancelTime?.TotalMilliseconds ?? 0));
                     }
-                    else if (environment.Context.CancellationToken.IsCancellationRequested
+                    else if (environment.SchedulerCancellationToken.IsCancellationRequested
                              && environment.Configuration.Distribution.BuildRole == DistributedBuildRoles.Worker)
                     {
                         // The pip was cancelled due to the scheduler terminating on this distributed worker.

Original file line number	Diff line number	Diff line change
`@@ -2215,7 +2215,7 @@ private static async Task<SandboxedProcessPipExecutionResult> ExecutePipAndHandl`
`2215`	`2215`	`expectedCommitMb: expectedMemoryCounters.PeakCommitSizeMb,`
`2216`	`2216`	`cancelMilliseconds: (int)(cancelTime?.TotalMilliseconds ?? 0));`
`2217`	`2217`	`}`
`2218`		`- else if (environment.Context.CancellationToken.IsCancellationRequested`
	`2218`	`+ else if (environment.SchedulerCancellationToken.IsCancellationRequested`
`2219`	`2219`	`&& environment.Configuration.Distribution.BuildRole == DistributedBuildRoles.Worker)`
`2220`	`2220`	`{`
`2221`	`2221`	`// The pip was cancelled due to the scheduler terminating on this distributed worker.`