report: include failure_reason for manual kills and marking run dead#2102
report: include failure_reason for manual kills and marking run dead#2102
Conversation
|
Hey @kamoltat, @amathuria, and @djgalloway, my PR is ready—please take a look when you can. |
| # codebase so tooling can pick it up. | ||
| job_info['failure_reason'] = reason | ||
|
|
||
| reporter.report_job(run_name, job_id, job_info=job_info) |
There was a problem hiding this comment.
I think you still need dead=True here.
Otherwise it'll be False:
def report_job(self, run_name, job_id, job_info=None, dead=False)
and once you do that job status is already updated in report_job():
if dead and get_status(job_info) is None:
set_status(job_info, 'dead')
I think we can stick to just updating the failure reason here
There was a problem hiding this comment.
Sure, got it. I’ll make the suggested changes accordingly.
Steps performed->Searched codebase for where jobs are marked dead and where job info is pushed (try_push_job_info, reporter.report_job, kill, report.py). ->Implemented t ->Updated kill.py: kill_job() now pushes kill_run() calls ->Updated package-failure path (task/internal/init.py): push status='dead' with failure_reason=msg. ->Added unit test teuthology/test/test_report_dead_reason.py that mocks ResultsReporter and asserts failure_reason is included when try_mark_run_dead(..., reason=...) is called. After the suggestion->Implemented the change in report.py: call reporter.report_job(..., dead=True) (i.e., include dead=True flag when reporting the job). ->Re-ran the focused unit ->Re-ran the full test suite → all tests passed locally. |
feat(observe): Include failure_reason when jobs are manually killed or marked dead
Why this change?
Debugging and monitoring are difficult when manually killed jobs or runs marked 'dead' only update their status without providing a human-readable explanation. This makes it impossible for UI users or automated scraping tools to distinguish between a timeout, an application failure, or a deliberate operator action.
This change introduces a failure_reason field for these scenarios, which significantly improves observability and provides crucial context for why a job was terminated.
What I changed
This PR ensures a failure_reason is populated and pushed to the results server in three key scenarios:
report.py: The try_mark_run_dead() function now accepts a reason argument. If provided, this reason is injected into the job info as the failure_reason.
kill.py: When a user kills a job or run, the system now pushes a status='dead' update that includes a failure_reason (e.g., "killed by ").
init.py: When package checks fail (which also pushes status='dead'), the specific error message is now included as the failure_reason.
test_report_dead_reason.py: A new unit test is added to mock the reporter and verify that calling try_mark_run_dead(..., reason=...) correctly includes the failure_reason in the final report.
Tests performed (local)
Installed the package in editable mode in a new venv.
Ran the new unit test successfully:
pytest -q test_report_dead_reason.py
1 passed