Skip to content

Conversation

eleonoradgr
Copy link
Contributor

@eleonoradgr eleonoradgr commented Oct 10, 2025

What changed?

Logic implemented in the shard distributor:
when receiving the heartbeat

  • Error if the namespace is in local passthrough mode, we expect no external calls
  • local passhtrough shadow and distributed passthrough have the same behavior: check the shard assignment for the executor, if nothing changed return it back, if it changed then delete the executor and add the shards again. In case these are the modality the namespace reconciliation loop is not running for the namespace (this way we do not reassign shards while we delete the executor)
  • no changes to the normal flow in case the mode is onboarded

Logic that will be implemented in a followup pr for executor library:

  • create the module to instantiate the executor with local passthrough or communication with SD
  • the modality will be assigned after the first hearbeat request (except if it is local passthrough which is statically assigned)
  • local passhtrough shadow, check the answer against the current request and do not assign back to the internal state
  • distributed passthrough before putting into place the new sharding assignment, send heartbeat and applied it after receiving it back
  • onboarded normal flow

NextNext PR
For each of the cases create a test in the canary

Why?

How did you test it?
Unit tests and local execution

Potential risks
the only service using it, it's the canary. No real impact in production

Release notes

Documentation Changes

Copy link
Member

@jakobht jakobht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
Looking at this I think we need to consider the switch to fully onboarded again.
Doing a gradual rollout of this will cause inconsistency in shard ownership, I think.

return nil, fmt.Errorf("delete executors: %w", err)
}
for shard := range request.GetShardStatusReports() {
err = h.storage.AssignShard(ctx, request.GetNamespace(), request.GetExecutorID(), shard)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Ok for now, but slightly worried about transactionallity - maybe we should have an "assignShardsToExecutor" function in the storage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we just deleted the executor from the store I'm concerned if this call will just fail since the executor is not there?
I think a new store function that's transactional sound good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, I moved the multiple assignments as a transaction, relying on the existing functionalities of the store module.

continue
}
if p.namespaceCfg.Mode != config.MigrationModeONBOARDED {
p.logger.Info("Namespace not onboarded, rebalance not triggered", tag.ShardNamespace(p.namespaceCfg.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might log a lot, but we can always adjust if it's too much

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it could be, we also have already log out of this if condition so we can consider to remove both if it is too noisy

@eleonoradgr
Copy link
Contributor Author

Looks great! Looking at this I think we need to consider the switch to fully onboarded again. Doing a gradual rollout of this will cause inconsistency in shard ownership, I think.

If we use the incrementally as implemented right now yes, it will be inconsistent but I am thinking to a different way of enabling the feature and still be safe from the global impact. Writing it down in the onboarding plan.

@eleonoradgr eleonoradgr changed the title feat [Shard distributor]: Implementation of onboarding logic feat: [Shard-Distributor] Implementation of onboarding logic Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants