Implement automated SRE healthchecks after deployment

## :white_check_mark: Checklist



- [x] I have searched open and closed issues for duplicates.
- [x] This is a request for a new feature in the Data Safe Haven or an upgrade to an existing feature.
- [x] The feature is still missing in the [latest version](https://github.com/alan-turing-institute/data-safe-haven/releases).
- [x] I have read through the [documentation](https://alan-turing-institute.github.io/data-safe-haven/).
- [x] This isn't an open-ended question (open a [discussion](https://github.com/alan-turing-institute/data-safe-haven/discussions) if it is).

## :strawberry: Suggested change

At the moment, checking the health of a deployed TRE requires logging into the SRE and manually run the smoke tests script, or performing a manual validation to cover features outside the smoke tests scope. It would be great if this can be automated.

I was envisioning something like this:

```
dsh sre deploy my-tre --healthcheck
```

To perform the health check right after deploying. Or

```
dsh sre healthcheck my-tre
```

To run the health checks to an already deployed TRE. The checks can include:

- If `cloud-init` successfully finished in all workspaces.
- If the `identity` container is up-and-running. Or any other container.
- If it's possible to `pip install` within a workspace.
- If it's possible to `git clone` within a workspace.
- If the DNS server is up and running.
- If DNS is properly configured in workstations and containers.
- If internet access is actually restricted in T2/T3.
- If the DNS Sidecar is running successfully.

And any other scenario worth checking.



## :steam_locomotive: How could this be done?



As an initial approximation, I would just use Pytest + Azure Python SDK to perform the checks. Or as @JimMadge suggested we can even use Selenium. Ideally, this good be a good start for a full CI/CD pipeline in the near future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement automated SRE healthchecks after deployment #2568

✅ Checklist

🍓 Suggested change

🚂 How could this be done?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement automated SRE healthchecks after deployment #2568

Description

✅ Checklist

🍓 Suggested change

🚂 How could this be done?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions