Skip to content

Implement automated SRE healthchecks after deploymentย #2568

@cptanalatriste

Description

@cptanalatriste

โœ… Checklist

  • I have searched open and closed issues for duplicates.
  • This is a request for a new feature in the Data Safe Haven or an upgrade to an existing feature.
  • The feature is still missing in the latest version.
  • I have read through the documentation.
  • This isn't an open-ended question (open a discussion if it is).

๐Ÿ“ Suggested change

At the moment, checking the health of a deployed TRE requires logging into the SRE and manually run the smoke tests script, or performing a manual validation to cover features outside the smoke tests scope. It would be great if this can be automated.

I was envisioning something like this:

dsh sre deploy my-tre --healthcheck

To perform the health check right after deploying. Or

dsh sre healthcheck my-tre

To run the health checks to an already deployed TRE. The checks can include:

  • If cloud-init successfully finished in all workspaces.
  • If the identity container is up-and-running. Or any other container.
  • If it's possible to pip install within a workspace.
  • If it's possible to git clone within a workspace.
  • If the DNS server is up and running.
  • If DNS is properly configured in workstations and containers.
  • If internet access is actually restricted in T2/T3.
  • If the DNS Sidecar is running successfully.

And any other scenario worth checking.

๐Ÿš‚ How could this be done?

As an initial approximation, I would just use Pytest + Azure Python SDK to perform the checks. Or as @JimMadge suggested we can even use Selenium. Ideally, this good be a good start for a full CI/CD pipeline in the near future.

Metadata

Metadata

Labels

enhancementNew functionality that should be added to the Safe Haven

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions