Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP.: Systest Postmortem #6222

Closed
wants to merge 2 commits into from
Closed

WIP.: Systest Postmortem #6222

wants to merge 2 commits into from

Conversation

dsmello
Copy link
Contributor

@dsmello dsmello commented Aug 6, 2024

Motivation

Add postmortem handler on the systest to allow the devs to have time to inspeact the cluster after a fail.

Description

Test Plan

TODO

  • Explain motivation or link existing issue(s)
  • Test changes and document test plan
  • Update documentation as needed
  • Update changelog as needed

@dsmello dsmello self-assigned this Aug 6, 2024
Copy link

codecov bot commented Aug 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.0%. Comparing base (787351f) to head (3b7c986).
Report is 90 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff           @@
##           develop   #6222   +/-   ##
=======================================
  Coverage     82.0%   82.0%           
=======================================
  Files          308     308           
  Lines        33906   33906           
=======================================
+ Hits         27810   27812    +2     
+ Misses        4320    4318    -2     
  Partials      1776    1776           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@fasmat fasmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few issues with this change, also I haven't yet fully understood what the goal of this change is?

)

/*
how the post-morten debug handler works:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I think you mean post-mortem?

type PostMortenDebugHandler struct {
Namespace string
isBeingDebugged bool
expriationTime int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

Suggested change
expriationTime int64
expirationTime int64

Comment on lines +42 to +44
PosTMortenDebug = parameters.Bool(
"post-morten-debug", "if true post-morten debug is allowed",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: also for the other parameters

Suggested change
PosTMortenDebug = parameters.Bool(
"post-morten-debug", "if true post-morten debug is allowed",
)
PotMortemDebug = parameters.Bool(
"post-mortem-debug", "if true post-mortem debug is allowed",
)

RegistrationLock sync.Mutex
)

// LazyFunctionActor - Because we cannot garantee if the cleanup function executed before the server started.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: typo

Suggested change
// LazyFunctionActor - Because we cannot garantee if the cleanup function executed before the server started.
// LazyFunctionActor - Because we cannot guarantee if the cleanup function executed before the server started.

Comment on lines +117 to +130
for ns, handler := range namespaces {
if time.Now().Unix() > handler.expriationTime {
Log.Infof("namespace %s expired", ns)
wg.Add(1)
go func() {
defer wg.Done()
for f := range handler.cleaningFunctions {
f()
}
}()
delete(namespaces, ns)
}

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect to me - if the first handler has an expiration time of 1 hour and the second of 10 seconds, the second isn't cleaned up until the first expires.

Also since namespaces is a map this is random, i.e. in that case sometimes the first namespace is cleaned up immediately and the second after one hour and sometimes both after one hour.

Comment on lines +104 to +106
if f != nil {
cleanupFunctions <- f
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will block indefinitely, because there is no second go routine that drains the channel while it is being written to here.

createTime int64
maximumDuration int64
cleaningFunctions chan func()
param *parameters.Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is unused

Comment on lines +35 to +39
Starter sync.Once
server *http.Server
Log *zap.SugaredLogger
namespaces = map[string]*PostMortenDebugHandler{}
lazyRegistration = []map[string]func(){}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these all need to be globals?

Comment on lines +58 to +59
// RegistrationLock
RegistrationLock sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed to be a global variable?

defer RegistrationLock.Unlock()

if server == nil {
lazyRegistration = append(lazyRegistration, map[string]func(){namespace: f})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is lazyRegistration a []map[string]func() and not a map[string][]func()?

Copy link
Member

@fasmat fasmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to have namespaces and different cleanup functions for them?

As far as i understand this change it starts an http.Server (as part of the testrunner) that shuts down after the namespace with the longest expiration time has expired and executes func()s for the expired namespaces.

The server will not start in the namespace defined - so I assume the namespace still gets deleted immediately.

If the goal is to have something running on the cluster so that the namespace isn't deleted, why not deploy a nginx pod to that namespace when "post-mortem" is active?

time.Sleep(1 * time.Second)
wg := sync.WaitGroup{}

for len(namespaces) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need for this for loop? In the inner for loop we iterate over all namespaces and delete them one by one, so this outer for loop is only ever executed once?

@fasmat
Copy link
Member

fasmat commented Aug 6, 2024

This is the code that deletes the namespace after a test:

if !cctx.Keep {
cleanup(t, func() {
if err := deleteNamespace(cctx); err != nil {
cctx.Log.Errorw("cleanup failed", "error", err)
return
}
cctx.Log.Infow("namespace was deleted", "namespace", cctx.Namespace)
})
}

So if you want that test to stick around you can either ensure that cctx.Keep is true by passing the keep flag to the test runner, or if you only want that to happen when a test fails adjust the cleanup function to only execute the Namespace deletion when the test passes:

func cleanup(tb testing.TB, f func()) {
	tb.Cleanup(func() {
		if !tb.Failed() { // only execute f() when test passed
			f()
		}
	})
	signals := make(chan os.Signal, 1)
	signal.Notify(signals, syscall.SIGINT, syscall.SIGTERM)
	go func() {
		<-signals
		f()
		os.Exit(1)
	}()
}

@acud
Copy link
Contributor

acud commented Aug 7, 2024

@dsmello I'm not sure this is the way we wanna go. It adds a lot of complexity to the way we run the tests and this will make our lives more difficult in the future if we'd like to refactor this stuff away, since we will be relying on the functionality that it provides.

@fasmat has some other ideas on how to execute on the idea and provide the same functionality. Maybe we could all jump on a call these days and discuss this stuff?

@lrettig lrettig changed the title WIP.: Systest Postmorten WIP.: Systest Postmortem Sep 14, 2024
@fasmat fasmat closed this Sep 27, 2024
@fasmat fasmat deleted the systest/enable-debug-fail branch September 27, 2024 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants