fix: 修复 OpenAI 账号 429 冷却时间误判,增加限流账号恢复机制#2290
Open
wucm667 wants to merge 2 commits into
Open
Conversation
sakurawztlt
added a commit
to sakurawztlt/sub2api
that referenced
this pull request
May 9, 2026
PR Wei-Shaw#2290 主体修 (Wei-Shaw upstream): - calculateOpenAI429ResetTime: 5h/7d 都没耗尽时返 nil (而不是取 max reset 当冷却), 让 caller 走短 fallback. 之前会把 5h/7d 的下个 reset (几小时-几天) 当成 429 cooldown, Account.IsSchedulable 因 RateLimitResetAt 排除该账号最后 503. - parseOpenAIRateLimitResetTime: 加 clampOpenAIRateLimitExceededReset 把 rate_limit_exceeded body 里的 resets_at/resets_in_seconds clamp 到 maxRateLimit429CooldownSeconds (我们 fork 7200s = 2h), 防上游恶意 / 误填长 reset 锁死账号. - recoverOpenAIRateLimitedAccountBeforeNoAvailable: 没账号可用前清 rate-limited 但 codex 5h/7d 都没真耗尽的账号, 重选. 配套 selectAccountWithLoadAwareness 三处 ErrNoAvailableAccounts 之前调. codex 5/9 audit 补的 4 项: 1. fork fallback 默认 300s (defaultRateLimit429CooldownSeconds), 不是 PR 文案的 5s — 已是 ✓ 不动. 2. **加 singleflight** rateLimitRecoveryFlight 在 OpenAIGatewayService struct, clearOpenAIRateLimitForRecovery 用 singleflight.Do(accountID) 防并发清同一账号撞 429 死循环. 3. **advanced scheduler 4 处 no-available 也补 recover**: openai_account_scheduler.go:604/644/825/890 (selectByLoadBalance accounts==0 / filtered==0 + selectByCandidates selectionOrder==0 / 全跑过没拿到). PR 主代码只补了 default scheduler (selectAccountWithLoadAwareness), advanced scheduler 启用时漏了. 4. **手工合 openai_gateway_service_test.go**: PR 那个 test 文件冲突, 跳过 PR 加的 TestOpenAISelectAccountForModelWithExclusions_RecoversNonExhaustedCodexRateLimit (依赖 PR 加的 recoverableOpenAIAccountRepo stub), 但补 stubOpenAIAccountRepo 的 ListByPlatform / ListByGroup / ClearRateLimit 三个方法, 否则现有 test 跑到 recover 路径 nil-pointer panic. 全部 sub2api test 过 (含 handler / pkg / service 全套). 没动 prod.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2258
问题背景
OpenAI Codex 账号在收到 429 响应时,系统会读取响应头中的 Codex 用量窗口信息(5h/7d)来设置冷却时间。但当两个窗口均未耗尽(
used_percent < 100%)时,之前的逻辑会错误地取两个窗口reset-after-seconds中的较大值作为冷却时长:此外,
parseOpenAIRateLimitResetTime对rate_limit_exceeded类型的错误缺乏上限保护,响应体中的resets_at值完全由 OpenAI 控制,也可能导致数小时甚至数天的冷却。本次修复内容
1. 修复冷却时间误判(
ratelimit_service.go)calculateOpenAI429ResetTime:当两个 Codex 用量窗口均未耗尽时,原兜底分支取最大reset-after-seconds,改为直接返回nil,让调用方走到apply429FallbackRateLimit(默认 5 秒短冷却,上限 2h),而非错误地使用用量窗口的重置时间。parseOpenAIRateLimitResetTime加 clamp 保护:新增clampOpenAIRateLimitExceededReset函数,区分错误类型:rate_limit_exceeded(突发限流):对解析到的resets_at施加上限(maxRateLimit429CooldownSeconds),避免过长冷却usage_limit_reached(真正用量耗尽):保持上游返回值不变同时将
parseOpenAIRateLimitResetTime重构为可注入时间的parseOpenAIRateLimitResetTimeAt,提升测试覆盖能力。2. 增加限流账号恢复机制(
openai_gateway_service.go)由于 OpenAI 平台缺少"成功响应自愈"路径(不同于 Anthropic),当即将返回 503/无可用账号时,新增以下恢复逻辑:
recoverOpenAIRateLimitedAccountBeforeNoAvailable:在selectBestAccount返回 nil 时,遍历仅因限流被阻塞的账号,验证其 Codex 用量快照(5h/7d 均未耗尽),清除限流状态后重新投入调度。安全性:若账号实际仍在限流,上游会再次返回 429 并重新标记。recoverOpenAIRateLimitedSelectionBeforeNoAvailable:在selectAccountWithLoadAwareness的三处"无可用账号"出口处均接入上述恢复逻辑。openAICodexSnapshotShowsNonExhaustedWindows:通过账号 Extra 字段中的codex_5h_used_percent、codex_7d_used_percent及快照时间codex_usage_updated_at判断快照是否有效(快照时间与限流时间差不超过 2 分钟),避免使用过期快照误判。openAICodexRecoverySnapshotMaxSkew = 2 * time.Minute。测试覆盖
新增/修改测试(
openai_gateway_service_test.go、ratelimit_service_openai_test.go):TestOpenAISelectAccountForModelWithExclusions_RecoversNonExhaustedCodexRateLimit:验证用量未耗尽的限流账号被正确恢复TestOpenAISelectAccountWithLoadAwareness_RecoversNonExhaustedCodexRateLimit:验证负载感知调度路径的恢复逻辑TestOpenAISelectAccountForModelWithExclusions_DoesNotRecoverExhaustedCodexRateLimit:验证 7d 用量已达 100% 时不触发恢复TestCalculateOpenAI429ResetTime_NeitherExhausted_UsesFallback:验证两个窗口均未耗尽时返回 nilTestParseOpenAIRateLimitResetTime_RateLimitExceededClampsLongReset:验证rate_limit_exceeded长重置时间被 clampTestParseOpenAIRateLimitResetTime_UsageLimitReachedKeepsLongReset:验证usage_limit_reached重置时间不受 clamp 影响变更文件
backend/internal/service/openai_gateway_service.gobackend/internal/service/openai_gateway_service_test.gobackend/internal/service/ratelimit_service.gobackend/internal/service/ratelimit_service_openai_test.go