WIP.: Systest Postmortem #6222

dsmello · 2024-08-06T17:56:05Z

Motivation

Add postmortem handler on the systest to allow the devs to have time to inspeact the cluster after a fail.

Description

Test Plan

TODO

Explain motivation or link existing issue(s)
Test changes and document test plan
Update documentation as needed
Update changelog as needed

…to inspeact the cluster after a fail.

…ug-fail

codecov · 2024-08-06T18:13:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.0%. Comparing base (787351f) to head (3b7c986).
Report is 90 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #6222   +/-   ##
=======================================
  Coverage     82.0%   82.0%           
=======================================
  Files          308     308           
  Lines        33906   33906           
=======================================
+ Hits         27810   27812    +2     
+ Misses        4320    4318    -2     
  Partials      1776    1776

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fasmat

I see a few issues with this change, also I haven't yet fully understood what the goal of this change is?

fasmat · 2024-08-06T18:02:39Z

systest/postmortenHandler/postmortenhandler.go

+)
+
+/*
+how the post-morten debug handler works:


NIT: I think you mean post-mortem?

fasmat · 2024-08-06T18:03:56Z

systest/postmortenHandler/postmortenhandler.go

+type PostMortenDebugHandler struct {
+	Namespace         string
+	isBeingDebugged   bool
+	expriationTime    int64


NIT:

Suggested change

expriationTime int64

expirationTime int64

fasmat · 2024-08-06T18:05:38Z

systest/postmortenHandler/postmortenhandler.go

+	PosTMortenDebug = parameters.Bool(
+		"post-morten-debug", "if true post-morten debug is allowed",
+	)


NIT: also for the other parameters

Suggested change

PosTMortenDebug = parameters.Bool(

"post-morten-debug", "if true post-morten debug is allowed",

)

PotMortemDebug = parameters.Bool(

"post-mortem-debug", "if true post-mortem debug is allowed",

)

fasmat · 2024-08-06T18:05:59Z

systest/postmortenHandler/postmortenhandler.go

+	RegistrationLock sync.Mutex
+)
+
+// LazyFunctionActor - Because we cannot garantee if the cleanup function executed before the server started.


NIT: typo

Suggested change

// LazyFunctionActor - Because we cannot garantee if the cleanup function executed before the server started.

// LazyFunctionActor - Because we cannot guarantee if the cleanup function executed before the server started.

fasmat · 2024-08-06T18:14:14Z

systest/postmortenHandler/postmortenhandler.go

+		for ns, handler := range namespaces {
+			if time.Now().Unix() > handler.expriationTime {
+				Log.Infof("namespace %s expired", ns)
+				wg.Add(1)
+				go func() {
+					defer wg.Done()
+					for f := range handler.cleaningFunctions {
+						f()
+					}
+				}()
+				delete(namespaces, ns)
+			}
+
+		}


This seems incorrect to me - if the first handler has an expiration time of 1 hour and the second of 10 seconds, the second isn't cleaned up until the first expires.

Also since namespaces is a map this is random, i.e. in that case sometimes the first namespace is cleaned up immediately and the second after one hour and sometimes both after one hour.

fasmat · 2024-08-06T18:16:35Z

systest/postmortenHandler/postmortenhandler.go

+	if f != nil {
+		cleanupFunctions <- f
+	}


This will block indefinitely, because there is no second go routine that drains the channel while it is being written to here.

fasmat · 2024-08-06T18:18:55Z

systest/postmortenHandler/postmortenhandler.go

+	createTime        int64
+	maximumDuration   int64
+	cleaningFunctions chan func()
+	param             *parameters.Parameters


This field is unused

fasmat · 2024-08-06T18:19:14Z

systest/postmortenHandler/postmortenhandler.go

+	Starter          sync.Once
+	server           *http.Server
+	Log              *zap.SugaredLogger
+	namespaces       = map[string]*PostMortenDebugHandler{}
+	lazyRegistration = []map[string]func(){}


Do these all need to be globals?

fasmat · 2024-08-06T18:19:28Z

systest/postmortenHandler/postmortenhandler.go

+	// RegistrationLock
+	RegistrationLock sync.Mutex


Is this needed to be a global variable?

fasmat · 2024-08-06T18:21:05Z

systest/postmortenHandler/postmortenhandler.go

+	defer RegistrationLock.Unlock()
+
+	if server == nil {
+		lazyRegistration = append(lazyRegistration, map[string]func(){namespace: f})


Why is lazyRegistration a []map[string]func() and not a map[string][]func()?

fasmat

What is the reason to have namespaces and different cleanup functions for them?

As far as i understand this change it starts an http.Server (as part of the testrunner) that shuts down after the namespace with the longest expiration time has expired and executes func()s for the expired namespaces.

The server will not start in the namespace defined - so I assume the namespace still gets deleted immediately.

If the goal is to have something running on the cluster so that the namespace isn't deleted, why not deploy a nginx pod to that namespace when "post-mortem" is active?

fasmat · 2024-08-06T18:26:53Z

systest/postmortenHandler/postmortenhandler.go

+	time.Sleep(1 * time.Second)
+	wg := sync.WaitGroup{}
+
+	for len(namespaces) > 0 {


I don't understand the need for this for loop? In the inner for loop we iterate over all namespaces and delete them one by one, so this outer for loop is only ever executed once?

fasmat · 2024-08-06T18:49:01Z

This is the code that deletes the namespace after a test:

go-spacemesh/systest/testcontext/context.go

Lines 369 to 377 in 787351f

    
           if !cctx.Keep { 
        
           	cleanup(t, func() { 
        
           		if err := deleteNamespace(cctx); err != nil { 
        
           			cctx.Log.Errorw("cleanup failed", "error", err) 
        
           			return 
        
           		} 
        
           		cctx.Log.Infow("namespace was deleted", "namespace", cctx.Namespace) 
        
           	}) 
        
           }

So if you want that test to stick around you can either ensure that cctx.Keep is true by passing the keep flag to the test runner, or if you only want that to happen when a test fails adjust the cleanup function to only execute the Namespace deletion when the test passes:

func cleanup(tb testing.TB, f func()) {
	tb.Cleanup(func() {
		if !tb.Failed() { // only execute f() when test passed
			f()
		}
	})
	signals := make(chan os.Signal, 1)
	signal.Notify(signals, syscall.SIGINT, syscall.SIGTERM)
	go func() {
		<-signals
		f()
		os.Exit(1)
	}()
}

acud · 2024-08-07T20:21:02Z

@dsmello I'm not sure this is the way we wanna go. It adds a lot of complexity to the way we run the tests and this will make our lives more difficult in the future if we'd like to refactor this stuff away, since we will be relying on the functionality that it provides.

@fasmat has some other ideas on how to execute on the idea and provide the same functionality. Maybe we could all jump on a call these days and discuss this stuff?

Add postmorten handler on the systest to allow the devs to have time …

ea55a82

…to inspeact the cluster after a fail.

dsmello requested review from dshulyak, fasmat, poszu, ivan4th and acud as code owners August 6, 2024 17:56

dsmello self-assigned this Aug 6, 2024

dsmello removed request for ivan4th, dshulyak, poszu and fasmat August 6, 2024 17:56

Merge remote-tracking branch 'origin/develop' into systest/enable-deb…

3b7c986

…ug-fail

fasmat requested changes Aug 6, 2024

View reviewed changes

fasmat reviewed Aug 6, 2024

View reviewed changes

lrettig changed the title ~~WIP.: Systest Postmorten~~ WIP.: Systest Postmortem Sep 14, 2024

fasmat closed this Sep 27, 2024

fasmat deleted the systest/enable-debug-fail branch September 27, 2024 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP.: Systest Postmortem #6222

WIP.: Systest Postmortem #6222

dsmello commented Aug 6, 2024 •

edited by lrettig

Loading

codecov bot commented Aug 6, 2024 •

edited

Loading

fasmat left a comment

fasmat Aug 6, 2024

fasmat Aug 6, 2024

fasmat Aug 6, 2024

fasmat Aug 6, 2024

fasmat Aug 6, 2024

fasmat Aug 6, 2024

fasmat Aug 6, 2024

fasmat Aug 6, 2024

fasmat Aug 6, 2024

fasmat Aug 6, 2024

fasmat left a comment

fasmat Aug 6, 2024

fasmat commented Aug 6, 2024

acud commented Aug 7, 2024

	// LazyFunctionActor - Because we cannot garantee if the cleanup function executed before the server started.
	// LazyFunctionActor - Because we cannot guarantee if the cleanup function executed before the server started.

WIP.: Systest Postmortem #6222

WIP.: Systest Postmortem #6222

Conversation

dsmello commented Aug 6, 2024 • edited by lrettig Loading

Motivation

Description

Test Plan

TODO

codecov bot commented Aug 6, 2024 • edited Loading

Codecov Report

fasmat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fasmat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fasmat commented Aug 6, 2024

acud commented Aug 7, 2024

dsmello commented Aug 6, 2024 •

edited by lrettig

Loading

codecov bot commented Aug 6, 2024 •

edited

Loading