Skip to content

feat: graceful shutdown #1977#3235

Open
Oxidaner wants to merge 10 commits intoapache:developfrom
Oxidaner:feat/graceful_shutdown
Open

feat: graceful shutdown #1977#3235
Oxidaner wants to merge 10 commits intoapache:developfrom
Oxidaner:feat/graceful_shutdown

Conversation

@Oxidaner
Copy link
Contributor

@Oxidaner Oxidaner commented Mar 8, 2026

What kind of change does this PR introduce?

  • Feature

What is the current behavior?

Related Issue: #1977

What is the new behavior?

This PR adds graceful shutdown enhancement with the following features:

  1. Exponential backoff retry for active notification
  • When sending notification to consumers fails, the server will retry with exponential backoff
  • Default: 500ms → 1s → 2s, max 3 retries
  1. Active awareness via connection error detection
  • Client detects connection errors (EOF, broken pipe, gRPC closing, http2 closing)
  • Marks the invoker as closing to avoid routing requests to unavailable instances
  1. Closing flag in response
  • Server adds closing flag in response attachment
  • Client checks this flag to detect closing state
  1. Protocol-level graceful shutdown callback
  • Added SetGracefulShutdownCallback for protocol-specific shutdown logic
  • gRPC protocol uses this for GracefulStop()

@Oxidaner Oxidaner changed the title Feat/graceful shutdown Feat: graceful shutdown Mar 8, 2026
@Oxidaner Oxidaner changed the title Feat: graceful shutdown feat: graceful shutdown Mar 8, 2026
@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2026

Codecov Report

❌ Patch coverage is 3.52113% with 137 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.75%. Comparing base (60d1c2a) to head (9ef55e6).
⚠️ Report is 750 commits behind head on develop.

Files with missing lines Patch % Lines
filter/graceful_shutdown/consumer_filter.go 5.17% 53 Missing and 2 partials ⚠️
graceful_shutdown/shutdown.go 0.00% 42 Missing ⚠️
protocol/triple/triple.go 0.00% 12 Missing and 1 partial ⚠️
protocol/grpc/grpc_protocol.go 0.00% 11 Missing and 1 partial ⚠️
common/extension/graceful_shutdown.go 0.00% 7 Missing ⚠️
filter/graceful_shutdown/provider_filter.go 0.00% 6 Missing ⚠️
protocol/base/base_invoker.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3235      +/-   ##
===========================================
+ Coverage    46.76%   47.75%   +0.98%     
===========================================
  Files          295      463     +168     
  Lines        17172    33979   +16807     
===========================================
+ Hits          8031    16227    +8196     
- Misses        8287    16436    +8149     
- Partials       854     1316     +462     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Oxidaner Oxidaner changed the title feat: graceful shutdown feat: graceful shutdown # 1977 Mar 8, 2026
@Oxidaner Oxidaner changed the title feat: graceful shutdown # 1977 feat: graceful shutdown #1977 Mar 8, 2026
@Alanxtl Alanxtl linked an issue Mar 8, 2026 that may be closed by this pull request
@Alanxtl Alanxtl added ✏️ Feature 3.3.2 version 3.3.2 labels Mar 8, 2026
Comment on lines -87 to -109
// those signals' original behavior is exit with dump ths stack, so we try to keep the behavior
for _, dumpSignal := range DumpHeapShutdownSignals {
if sig == dumpSignal {
debug.WriteHeapDump(os.Stdout.Fd())
}
}
os.Exit(0)

}()
}
}

// BeforeShutdown provides processing flow before shutdown
func BeforeShutdown() {
destroyAllRegistries()
// waiting for a short time so that the clients have enough time to get the notification that server shutdowns
// The value of configuration depends on how long the clients will get notification.
waitAndAcceptNewRequests()

// reject sending/receiving the new request, but keeping waiting for accepting requests
waitForSendingAndReceivingRequests()

// destroy all protocols
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not delete these comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I will restore annotations

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 8, 2026

Comment on lines +58 to +60
var ok bool
tripleProtocol = tp.(*TripleProtocol)
if !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类型断言写错了:tripleProtocol = tp.(*TripleProtocol) 没有捕获 ok,而 var ok bool 声明后从未赋值,始终是 false。导致 if !ok { return nil } 永远成立,回调永远提前返回,grpc.GracefulStop() 从不执行。triple 的 graceful shutdown 完全失效。

建议改为:

tripleProtocol, ok := tp.(*TripleProtocol)
if !ok {
    return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'll correct it

Comment on lines +46 to +58
func SetGracefulShutdownCallback(name string, f GracefulShutdownCallback) {
gracefulShutdownCallbacks[name] = f
}

// GetGracefulShutdownCallback returns protocol's graceful shutdown callback
func GetGracefulShutdownCallback(name string) (GracefulShutdownCallback, bool) {
f, ok := gracefulShutdownCallbacks[name]
return f, ok
}

// GetAllGracefulShutdownCallbacks returns all protocol's graceful shutdown callbacks
func GetAllGracefulShutdownCallbacks() map[string]GracefulShutdownCallback {
return gracefulShutdownCallbacks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gracefulShutdownCallbacks 是全局 map,SetGracefulShutdownCallbackGetAllGracefulShutdownCallbacks 没有任何锁保护,Go race detector 会报警。

GetAllGracefulShutdownCallbacks 直接返回内部 map 引用,调用方若并发修改则 crash。

建议:加 sync.RWMutex 保护;GetAllGracefulShutdownCallbacks 返回副本而不是原始 map。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'll correct it

func notifyLongConnectionConsumers() {
logger.Info("Graceful shutdown --- Notify long connection consumers.")

notifyTimeout := 3 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notifyTimeout 硬编码 3s,与 ShutdownConfigStepTimeoutConsumerUpdateWaitTime 等配置体系完全割裂,无法通过配置控制。

建议:函数签名改为接收 shutdown *global.ShutdownConfig,复用已有的 timeout 配置字段,或新增 NotifyTimeout 字段。

if attempt < maxRetries {
delay := baseDelay
for i := 0; i < attempt; i++ {
delay *= 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

三次重试的总等待时间:500ms + 1s + 2s = 3.5s,超过了外层 context timeout 3s,实际最多只能完成 1-2 次重试,配置自相矛盾。

另外 go.mod 里本就有 github.com/cenkalti/backoff/v4 依赖,不需要手写指数退避逻辑,直接用更可靠。

建议:要么把 notifyTimeout 调大(至少 10s),要么减少重试次数/延迟;退避逻辑改用 cenkalti/backoff/v4

Comment on lines +179 to +180
isConnectionError := strings.Contains(errMsg, "client has closed") ||
strings.Contains(errMsg, "connection") ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.Contains(errMsg, "connection") 过于宽泛,任何包含 "connection" 字样的业务错误(如 "database connection pool exhausted"、"connection refused by business logic")都会被误判为连接关闭错误,将正常 invoker 标记为 closing 并 SetAvailable(false),导致请求被永久拒绝直到 30s 过期。

建议:改用 errors.Is 对已知错误类型匹配,或缩窄字符串匹配为 "connection reset""use of closed network connection" 等具体错误。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'll correct it

if time.Now().Before(expireTime.(time.Time)) {
return true
}
f.closingInvokers.Delete(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markClosingInvoker 调用了 bi.SetAvailable(false),30s 过期后 closingInvokers.Delete(key) 把节点从 map 移除,但 IsAvailable() 仍然是 false,负载均衡器会永久跳过该节点,产生永久不可用的 invoker。

建议:closingInvokers.Delete(key) 之后同步调用 bi.SetAvailable(true) 恢复状态;或者干脆不调用 SetAvailable(false),只靠 closingInvokers 的过期机制控制路由。

Comment on lines +55 to +59
gp.serverLock.Lock()
defer gp.serverLock.Unlock()

for _, server := range gp.serverMap {
server.GracefulStop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GracefulStop() 是阻塞调用,会等待所有活跃 RPC 完成。在持有 serverLock 的情况下调用,若 GracefulStop() 等待的 RPC handler 内部触发了 Export() 等需要 serverLock 的操作,会死锁。triple.go 的对应实现存在同样问题。

建议:先拷贝 server 列表,释放锁,再逐一调用 GracefulStop()

gp.serverLock.Lock()
servers := make([]*Server, 0, len(gp.serverMap))
for _, s := range gp.serverMap {
    servers = append(servers, s)
}
gp.serverLock.Unlock()
for _, s := range servers {
    s.GracefulStop()
}


// add closing flag to response
if f.isClosing() {
result.AddAttachment(constant.GracefulShutdownClosingKey, "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result 可能为 nil(invoker 返回 nil result 时),直接调用 result.AddAttachment 会 panic。

建议:加 nil 检查:

if f.isClosing() && result != nil {
    result.AddAttachment(constant.GracefulShutdownClosingKey, "true")
}

Copy link
Contributor

@Alanxtl Alanxtl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to write a sample (integration test) for this feature

Comment on lines +45 to +59
// SetGracefulShutdownCallback sets protocol-level graceful shutdown callback
func SetGracefulShutdownCallback(name string, f GracefulShutdownCallback) {
gracefulShutdownCallbacks[name] = f
}

// GetGracefulShutdownCallback returns protocol's graceful shutdown callback
func GetGracefulShutdownCallback(name string) (GracefulShutdownCallback, bool) {
f, ok := gracefulShutdownCallbacks[name]
return f, ok
}

// GetAllGracefulShutdownCallbacks returns all protocol's graceful shutdown callbacks
func GetAllGracefulShutdownCallbacks() map[string]GracefulShutdownCallback {
return gracefulShutdownCallbacks
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. getter和setter风格的命名不是go的习惯,可以考虑改成
func RegisterGracefulShutdownCallback(name string, f GracefulShutdownCallback)
func LookupGracefulShutdownCallback(name string) (GracefulShutdownCallback, bool)
func GracefulShutdownCallbacks() map[string]GracefulShutdownCallback
  1. 并发访问问题需要再考虑一下
  2. 这个SetGracefulShutdownCallback是否允许重复注册,如:
extension.SetGracefulShutdownCallback(GRPC, cb1)
extension.SetGracefulShutdownCallback(GRPC, cb2)

如果不允许需要加一下判断

  1. 不要直接把内部 map 返回出去,如果需要的话可以返回一个拷贝

Comment on lines -26 to -45
/**
* AddCustomShutdownCallback
* you should not make any assumption about the order.
* For example, if you have more than one callbacks, and you wish the order is:
* callback1()
* callback2()
* ...
* callbackN()
* Then you should put then together:
* func callback() {
* callback1()
* callback2()
* ...
* callbackN()
* }
* I think the order of custom callbacks should be decided by the users.
* Even though I can design a mechanism to support the ordered custom callbacks,
* the benefit of that mechanism is low.
* And it may introduce much complication for another users.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个注释也别删


if bi, ok := invoker.(*base.BaseInvoker); ok {
bi.SetAvailable(false)
logger.Infof("Graceful shutdown: set invoker unavailable: %s, IsAvailable now=%v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Infof("Graceful shutdown: set invoker unavailable: %s, IsAvailable now=%v",
logger.Infof("Graceful shutdown --- Set invoker unavailable: %s, IsAvailable now=%v",

expireTime := time.Now().Add(f.getClosingInvokerExpireTime())
f.closingInvokers.Store(key, expireTime)

logger.Infof("Graceful shutdown: connection error detected for invoker: %s, marking as closing, will expire at %v, IsAvailable=%v",
Copy link
Contributor

@Alanxtl Alanxtl Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其他的logger.Infof都改一下

defaultStepTimeout = 3 * time.Second
defaultConsumerUpdateWaitTime = 3 * time.Second
defaultOfflineRequestWindowTimeout = 3 * time.Second
// retry config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// retry config
// retry config

Comment on lines +92 to +95
// SetAvailable sets available flag
func (bi *BaseInvoker) SetAvailable(available bool) {
bi.available.Store(available)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个没必要封装一个函数,之前都是直接store的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support offline gracefully without registry notification.

4 participants