Github Actions Test / Benchmark Run

In order to verify the correctness of PRs we should have a test / benchmark suite which runs on a variety of hardware / OS configs. GCP seems like the ideal provider here, as they have RTX 6000 Pro's, Windows Server, and we already have systems using GCP.

While we should extend this to a variety of consumer hardware via Vast, this is out of scope for our CI/CD MVP. We should ensure there's coverage for the following **machines**
- Windows Server + L4 GPU (G2)
- Linux + L4 GPU (G2)
- Linux + RTX PRO 6000 (G4)

Tests / benchmark suite should run when commits are pushed to a "ready for review" PR.

Engine creating parameters should be defined using the signature for `WorldEngine.__init__`. Here's some good starter **WorldEngine Configs**:
```
[
    {"model_uri": "Waypoint-1.5-1B", "quant": null, "model_config_overrides": null},
    {"model_uri": "Waypoint-1.5-1B", "quant": "intw8a8", "model_config_overrides": null},
]
````

## Running

For each (**machine**, **WorldEngine Config**) run the following for both **main** and **HEAD of the PR branch**
- 1) Performance: Run the benchmarks. This should be used to create a table comparing the performance of all machines / configs for `main` and the PR. `examples/benchmark.py` can be adapted to a script which calculates LFPS. Should run a 256 frame rollout for now. Note: any failed runs should be marked as such in the benchmark table rather than excluded.
- 2) Consistency: Run a forward pass with fully populated KV cache. You can use `WorldEngine.get_state(...)` and `WorldEngine.load_state(...)` to create a shared state across all runs. Then calculate the MSE between the latent output of `main` and this PR. Note: use `torch.use_deterministic_algorithms(True)` for this step only.

## Misc

Heuristic: Prefer fewer changes, fewer added files, fewer lines of code.

Per Mithun: "if possible, we should design it so that it's provider-agnostic and that it's easy to add additional tasks, so that we can onboard vast later and/or add more tests if required (e.g. producing samples that we can look at)"
- Caveat: if Mithuns suggestion significantly complicates things / increases scope, it can be skipped for now, otherwise it's preferable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Github Actions Test / Benchmark Run #48

Running

Misc

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Github Actions Test / Benchmark Run #48

Description

Running

Misc

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions