Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

查询指标请求考虑流量打散 #2409

Closed
RockysGit opened this issue Dec 30, 2024 · 8 comments
Closed

查询指标请求考虑流量打散 #2409

RockysGit opened this issue Dec 30, 2024 · 8 comments

Comments

@RockysGit
Copy link

Question and Steps to reproduce

服务在某些情况下重启后,告警规则会同一时刻start,这种情况下,如果告警规则配置的执行频率相同时(比如都是15s),数据源会每隔15s收到大量请求,而不是打散在15s内请求比较均衡。个人感觉是否可以考虑一下将cronJob的启动时机均匀一下,目前看代码没有找到有相关的处理逻辑。

Relevant logs and configurations

同一时刻启动执行频率相同的告警规则,数据源侧的qps会在每隔一个执行频率时间出现一个峰值

Version

v8.0.0

@RockysGit
Copy link
Author

Snipaste_2024-12-30_14-30-34 补充数据源的监控图

@UlricQin
Copy link
Member

下个版本搞一下

@UlricQin
Copy link
Member

UlricQin commented Jan 8, 2025

@RockysGit beta3 ok 了么?新的负载情况如何再截个图瞧瞧?

@RockysGit
Copy link
Author

RockysGit commented Jan 8, 2025

今天刚拉的新提交,bate3更新的代码有点多,我只看到有相关修改,还没具体测试。但是我看着好像有点问题。
1、仅在rule start增加sleep有点草率了,并不是多线程,阻塞了列表所有的rule.Start()。
2、我个人的实现可以参考一下,考虑了不同规则具有不同的执行频率。如果需要我可以提交一个mr

func (arw *AlertRuleWorker) Start() {
	// 增加随机启动时间,避免同时启动导致压力过大
	if arw.Rule.CronPattern == "" && arw.Rule.PromEvalInterval != 0 {
		arw.Rule.CronPattern = fmt.Sprintf("@every %ds", arw.Rule.PromEvalInterval)
	}
	go func() {
		defer func() {
			if r := recover(); r != nil {
				logger.Errorf("eval:%s recovered from panic: %v", arw.Key(), r)
			}
		}()

		select {
		case <-time.After(calcStartTime(arw.Rule.CronPattern)):
			logger.Infof("eval:%s started", arw.Key())
			// 启动调度器
			arw.Scheduler.Start()
		case <-arw.Quit:
			logger.Infof("eval:%s stopped", arw.Key())
			return
		}
	}()
}
// 增加随机启动时间,避免同时启动导致流量不均衡
func calcStartTime(cron string) time.Duration {
	cronList := strings.Split(cron, " ")
	duration, err := time.ParseDuration(cronList[len(cronList)-1])
	if err != nil {
		logger.Errorf("Failed to parse duration: %v", err)
		return 0
	}
	rand.Seed(time.Now().UnixNano())
	seconds := rand.Intn(int(duration.Seconds()) * 1000)
	logger.Infof("Random seconds: %d", seconds)
	return time.Duration(seconds) * time.Millisecond
}

@RockysGit
Copy link
Author

测试结果:的确阻塞的整个切片中规则的启动, 还请尽快修复

2025-01-08 19:26:14.839557 INFO eval/eval.go:141 start sleep
·······
2025-01-08 19:26:24.191205 INFO eval/eval.go:144 eval:alert-8-666 started after waiting 9351000000 ms
2025-01-08 19:26:24.285856 INFO eval/eval.go:141 start sleep
········
2025-01-08 19:27:15.407883 INFO eval/eval.go:144 eval:alert-8-839 started after waiting 51121000000 ms

@710leo
Copy link
Member

710leo commented Jan 8, 2025

@RockysGit 最新代码已经修复了

@RockysGit
Copy link
Author

希望考虑一下合理去设计规则的启动时机,能够保证打散流量的同时,所有规则的延迟启动(sleep)都是并行,不能够形成阻塞。
当前的修改只能保证规则启动间隔为20ms, 如果100个规则同时启动流量还是聚集在前2s内。更坏情况下可能导致start所有规则会耗时很长

@RockysGit RockysGit reopened this Jan 9, 2025
@UlricQin
Copy link
Member

UlricQin commented Jan 11, 2025

这个机制就是最简单均衡的方式了,比随机 sleep 更均衡,随机 sleep 有可能会在某一刻产生过高的 qps。如果继续优化,就是把 20ms 做成配置,不同的公司可以再调整,目前相当于对tsdb的查询qps控制在500左右,如果改成10ms就是1000qps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants