-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP.: Systest Postmortem #6222
WIP.: Systest Postmortem #6222
Conversation
…to inspeact the cluster after a fail.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #6222 +/- ##
=======================================
Coverage 82.0% 82.0%
=======================================
Files 308 308
Lines 33906 33906
=======================================
+ Hits 27810 27812 +2
+ Misses 4320 4318 -2
Partials 1776 1776 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a few issues with this change, also I haven't yet fully understood what the goal of this change is?
) | ||
|
||
/* | ||
how the post-morten debug handler works: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I think you mean post-mortem?
type PostMortenDebugHandler struct { | ||
Namespace string | ||
isBeingDebugged bool | ||
expriationTime int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
expriationTime int64 | |
expirationTime int64 |
PosTMortenDebug = parameters.Bool( | ||
"post-morten-debug", "if true post-morten debug is allowed", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: also for the other parameters
PosTMortenDebug = parameters.Bool( | |
"post-morten-debug", "if true post-morten debug is allowed", | |
) | |
PotMortemDebug = parameters.Bool( | |
"post-mortem-debug", "if true post-mortem debug is allowed", | |
) |
RegistrationLock sync.Mutex | ||
) | ||
|
||
// LazyFunctionActor - Because we cannot garantee if the cleanup function executed before the server started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: typo
// LazyFunctionActor - Because we cannot garantee if the cleanup function executed before the server started. | |
// LazyFunctionActor - Because we cannot guarantee if the cleanup function executed before the server started. |
for ns, handler := range namespaces { | ||
if time.Now().Unix() > handler.expriationTime { | ||
Log.Infof("namespace %s expired", ns) | ||
wg.Add(1) | ||
go func() { | ||
defer wg.Done() | ||
for f := range handler.cleaningFunctions { | ||
f() | ||
} | ||
}() | ||
delete(namespaces, ns) | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems incorrect to me - if the first handler has an expiration time of 1 hour and the second of 10 seconds, the second isn't cleaned up until the first expires.
Also since namespaces
is a map this is random, i.e. in that case sometimes the first namespace is cleaned up immediately and the second after one hour and sometimes both after one hour.
if f != nil { | ||
cleanupFunctions <- f | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will block indefinitely, because there is no second go routine that drains the channel while it is being written to here.
createTime int64 | ||
maximumDuration int64 | ||
cleaningFunctions chan func() | ||
param *parameters.Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field is unused
Starter sync.Once | ||
server *http.Server | ||
Log *zap.SugaredLogger | ||
namespaces = map[string]*PostMortenDebugHandler{} | ||
lazyRegistration = []map[string]func(){} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these all need to be globals?
// RegistrationLock | ||
RegistrationLock sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed to be a global variable?
defer RegistrationLock.Unlock() | ||
|
||
if server == nil { | ||
lazyRegistration = append(lazyRegistration, map[string]func(){namespace: f}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is lazyRegistration a []map[string]func()
and not a map[string][]func()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to have namespaces and different cleanup functions for them?
As far as i understand this change it starts an http.Server
(as part of the testrunner) that shuts down after the namespace with the longest expiration time has expired and executes func()
s for the expired namespaces.
The server will not start in the namespace defined - so I assume the namespace still gets deleted immediately.
If the goal is to have something running on the cluster so that the namespace isn't deleted, why not deploy a nginx
pod to that namespace when "post-mortem" is active?
time.Sleep(1 * time.Second) | ||
wg := sync.WaitGroup{} | ||
|
||
for len(namespaces) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the need for this for loop? In the inner for loop we iterate over all namespaces and delete them one by one, so this outer for loop is only ever executed once?
This is the code that deletes the namespace after a test: go-spacemesh/systest/testcontext/context.go Lines 369 to 377 in 787351f
So if you want that test to stick around you can either ensure that func cleanup(tb testing.TB, f func()) {
tb.Cleanup(func() {
if !tb.Failed() { // only execute f() when test passed
f()
}
})
signals := make(chan os.Signal, 1)
signal.Notify(signals, syscall.SIGINT, syscall.SIGTERM)
go func() {
<-signals
f()
os.Exit(1)
}()
} |
@dsmello I'm not sure this is the way we wanna go. It adds a lot of complexity to the way we run the tests and this will make our lives more difficult in the future if we'd like to refactor this stuff away, since we will be relying on the functionality that it provides. @fasmat has some other ideas on how to execute on the idea and provide the same functionality. Maybe we could all jump on a call these days and discuss this stuff? |
Motivation
Add postmortem handler on the systest to allow the devs to have time to inspeact the cluster after a fail.
Description
Test Plan
TODO