Skip to content

Conversation

@surayya-MS
Copy link
Member

Fixes #12660

Context

The following assert fails:
/mt /m sometimes crashes with error:

Microsoft.Build.Framework.InternalErrorException: MSB0001: Internal MSBuild Error: Why are we trying to disconnect from a context that we already disconnected from?  Did we call DisconnectFromHost twice?
   at Microsoft.Build.Shared.ErrorUtilities.ThrowInternalError(String message, Exception innerException, Object[] args) in C:\Users\alinama\work\msbuild\src\Shared\ErrorUtilities.cs:line 69
   at Microsoft.Build.Shared.ErrorUtilities.VerifyThrow(Boolean condition, String unformattedMessage) in C:\Users\alinama\work\msbuild\src\Shared\ErrorUtilities.cs:line 195
   at Microsoft.Build.BackEnd.NodeProviderOutOfProcTaskHost.DisconnectFromHost(Int32 nodeId) in C:\Users\alinama\work\msbuild\src\Build\BackEnd\Components\Communications\NodeProviderOutOfProcTaskHost.cs:line 607
   at Microsoft.Build.BackEnd.TaskHostTask.Execute() in C:\Users\alinama\work\msbuild\src\Build\Instance\TaskFactories\TaskHostTask.cs:line 382
   at Microsoft.Build.BackEnd.TaskExecutionHost.Execute() in C:\Users\alinama\work\msbuild\src\Build\BackEnd\TaskExecutionHost\TaskExecutionHost.cs:line 636

Changes Made

Changed _nodeIdToPacketFactory and _nodeIdToPacketHandler from Dictionary to ConcurrentDictionary.

Testing

Manually tested to make sure the error is not there.

Notes

@surayya-MS surayya-MS requested review from AR-May and Copilot and removed request for Copilot October 30, 2025 11:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves thread safety in NodeProviderOutOfProcTaskHost by converting two dictionary fields from IDictionary<int, T> to ConcurrentDictionary<int, T>.

  • Changed field type declarations for _nodeIdToPacketFactory and _nodeIdToPacketHandler to use ConcurrentDictionary
  • Updated initialization from Dictionary to ConcurrentDictionary
  • Replaced Remove() calls with TryRemove() to use the concurrent dictionary API

Copy link
Member

@rainersigwald rainersigwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see how this solves the problem. Can you elaborate please?

@AR-May
Copy link
Member

AR-May commented Oct 31, 2025

I'm not sure I see how this solves the problem. Can you elaborate please?

oh, that's a mystery here. This fix solves the race condition in the prototype branch - I can attest to that fact. But neither me nor @surayya-MS (please correct me if you figured it out since our last sync) figured yet why it solves it, more precisely, why it runs into a race condition in the first place.

We need to investigate this bug a bit more

@AR-May AR-May marked this pull request as draft October 31, 2025 13:54
@JanProvaznik JanProvaznik self-assigned this Nov 18, 2025
@JanProvaznik
Copy link
Member

JanProvaznik commented Nov 18, 2025

There is no race condition, it's "just" non-concurrent directory corruption by concurrent write by multiple threads (of different ids - the current design assigns different taskhost node id to each thread node), the PR correctly fixes that. The message in the assert is a red herring.

@JanProvaznik JanProvaznik marked this pull request as ready for review November 18, 2025 12:36
@JanProvaznik
Copy link
Member

lgtm, but I want another reviewer

@AR-May
Copy link
Member

AR-May commented Nov 19, 2025

Ok, now I see. The writes to the dictionaries are behind some locks. However, those locks are from different tasks objects and do not actually protect from concurrent writes when we have more than one task host task running (i.e. when we have many nodes in one process). The fix makes sense now.

@JanProvaznik
Copy link
Member

JanProvaznik commented Nov 19, 2025

note:
I believe the locks protect us from double disconnect when cancellation arrives as task finished executing in the taskhost. They were not relevant to this multiple taskhosts scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in NodeProviderOutOfProcTaskHost

4 participants