Skip to content

Github Actions Test / Benchmark Run #48

@lapp0

Description

@lapp0

In order to verify the correctness of PRs we should have a test / benchmark suite which runs on a variety of hardware / OS configs. GCP seems like the ideal provider here, as they have RTX 6000 Pro's, Windows Server, and we already have systems using GCP.

While we should extend this to a variety of consumer hardware via Vast, this is out of scope for our CI/CD MVP. We should ensure there's coverage for the following machines

  • Windows Server + L4 GPU (G2)
  • Linux + L4 GPU (G2)
  • Linux + RTX PRO 6000 (G4)

Tests / benchmark suite should run when commits are pushed to a "ready for review" PR.

Engine creating parameters should be defined using the signature for WorldEngine.__init__. Here's some good starter WorldEngine Configs:

[
    {"model_uri": "Waypoint-1.5-1B", "quant": null, "model_config_overrides": null},
    {"model_uri": "Waypoint-1.5-1B", "quant": "intw8a8", "model_config_overrides": null},
]

Running

For each (machine, WorldEngine Config) run the following for both main and HEAD of the PR branch

    1. Performance: Run the benchmarks. This should be used to create a table comparing the performance of all machines / configs for main and the PR. examples/benchmark.py can be adapted to a script which calculates LFPS. Should run a 256 frame rollout for now. Note: any failed runs should be marked as such in the benchmark table rather than excluded.
    1. Consistency: Run a forward pass with fully populated KV cache. You can use WorldEngine.get_state(...) and WorldEngine.load_state(...) to create a shared state across all runs. Then calculate the MSE between the latent output of main and this PR. Note: use torch.use_deterministic_algorithms(True) for this step only.

Misc

Heuristic: Prefer fewer changes, fewer added files, fewer lines of code.

Per Mithun: "if possible, we should design it so that it's provider-agnostic and that it's easy to add additional tasks, so that we can onboard vast later and/or add more tests if required (e.g. producing samples that we can look at)"

  • Caveat: if Mithuns suggestion significantly complicates things / increases scope, it can be skipped for now, otherwise it's preferable.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions