You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NeMo Evaluator is built on four core principles to provide a reliable and versatile evaluation experience.
19
-
-**Reproducibility by Default** -- All configurations, random seeds, and software provenance are captured automatically for auditable and repeatable evaluations.
20
-
-**Scale Anywhere** -- Run evaluations from a local machine to a Slurm cluster or cloud-native backends like Lepton AI without changing your workflow.
21
-
-**State-of-the-Art Benchmarking** -- Access a comprehensive suite of over 100 benchmarks from 18 popular open-source evaluation harnesses. See the full list of [Supported benchmarks and evaluation harnesses](#supported-benchmarks-and-evaluation-harnesses).
22
-
-**Extensible and Customizable** -- Integrate new evaluation harnesses, add custom benchmarks with proprietary data, and define custom result exporters for existing MLOps tooling.
18
+
19
+
NeMo Evaluator is built on four core principles to provide a reliable and versatile evaluation experience:
20
+
21
+
-**Reproducibility by Default**: All configurations, random seeds, and software provenance are captured automatically for auditable and repeatable evaluations.
22
+
-**Scale Anywhere**: Run evaluations from a local machine to a Slurm cluster or cloud-native backends like Lepton AI without changing your workflow.
23
+
-**State-of-the-Art Benchmarking**: Access a comprehensive suite of over 100 benchmarks from 18 popular open-source evaluation harnesses. See the full list of [Supported benchmarks and evaluation harnesses](#supported-benchmarks-and-evaluation-harnesses).
24
+
-**Extensible and Customizable**: Integrate new evaluation harnesses, add custom benchmarks with proprietary data, and define custom result exporters for existing MLOps tooling.
23
25
24
26
### How It Works: Launcher and Core Engine
25
27
@@ -28,7 +30,7 @@ The platform consists of two main components:
28
30
-**`nemo-evaluator` ([The Evaluation Core Engine](./docs/nemo-evaluator/index.md))**: A Python library that manages the interaction between an evaluation harness and the model being tested.
29
31
-**`nemo-evaluator-launcher` ([The CLI and Orchestration](./docs/nemo-evaluator-launcher/index.md))**: The primary user interface and orchestration layer. It handles configuration, selects the execution environment, and launches the appropriate container to run the evaluation.
30
32
31
-
Most users only need to interact with the `nemo-evaluator-launcher`as universal gateway to different benchmarks and harnesses. It is however possible to interact directly with `nemo-evaluator` by following this [guide](./docs/nemo-evaluator/workflows/using-containers.md).
33
+
Most users typically interact with `nemo-evaluator-launcher`, which serves as a universal gateway to different benchmarks and harnesses. However, it is also possible to interact directly with `nemo-evaluator` by following this [guide](./docs/nemo-evaluator/workflows/using-containers.md).
32
34
33
35
```mermaid
34
36
graph TD
@@ -54,45 +56,52 @@ graph TD
54
56
Get your first evaluation result in minutes. This guide uses your local machine to run a small benchmark against an OpenAI API-compatible endpoint.
55
57
56
58
#### 1. Install the Launcher
59
+
57
60
The launcher is the only package required to get started.
58
61
59
62
```bash
60
63
pip install nemo-evaluator-launcher
61
64
```
62
65
63
66
#### 2. Set Up Your Model Endpoint
67
+
64
68
NeMo Evaluator works with any model that exposes an OpenAI-compatible endpoint. For this quickstart, we will use the OpenAI API.
65
69
66
70
**What is an OpenAI-compatible endpoint?** A server that exposes /v1/chat/completions and /v1/completions endpoints, matching the OpenAI API specification.
67
71
68
72
**Options for model endpoints:**
69
-
-**Hosted endpoints** (fastest): Use ready-to-use hosted models from providers like build.nvidia.com that expose OpenAI-compatible APIs with no hosting required.
73
+
74
+
-**Hosted endpoints** (fastest): Use ready-to-use hosted models from providers like [build.nvidia.com](https://build.nvidia.com) that expose OpenAI-compatible APIs with no hosting required.
70
75
-**Self-hosted options**: Host your own models using tools like NVIDIA NIM, vLLM, or TensorRT-LLM for full control over your evaluation environment.
71
76
72
77
For detailed setup instructions including self-hosted configurations, see the [tutorial guide](./docs/nemo-evaluator-launcher/tutorial.md).
73
78
74
79
**Getting an NGC API Key for build.nvidia.com:**
80
+
75
81
To use out-of-the-box build.nvidia.com APIs, you need an API key:
76
-
1. Register an account at [build.nvidia.com](https://build.nvidia.com)
77
-
2. In the Setup menu under Keys/Secrets, generate an API key
78
-
3. Set the environment variable by executing `export NGC_API_KEY=<<YOUR_API_KEY>>`
79
82
83
+
1. Register an account at [build.nvidia.com](https://build.nvidia.com).
84
+
2. In the Setup menu under Keys/Secrets, generate an API key.
85
+
3. Set the environment variable by executing `export NGC_API_KEY=<YOUR_API_KEY>`.
80
86
81
87
#### 3. Run Your First Evaluation
82
-
Run a small evaluation on your local machine. The launcher automatically pulls the correct container and executes the benchmark. The list of benchmarks is directly configured in the yaml file.
88
+
89
+
Run a small evaluation on your local machine. The launcher automatically pulls the correct container and executes the benchmark. The list of benchmarks is directly configured in the YAML file.
83
90
84
91
**Configuration Examples**: Explore ready-to-use configuration files in [`packages/nemo-evaluator-launcher/examples/`](./packages/nemo-evaluator-launcher/examples/) for local, Lepton, and Slurm deployments with various model hosting options (vLLM, NIM, hosted endpoints).
85
92
86
-
Once you have the example configuration file (either by cloning this repository or downloading e.g. the `local_nvidia_nemotron_nano_9b_v2.yaml` file directly), you can run the following command:
93
+
Once you have the example configuration file, either by cloning this repository or downloading one directly such as `local_nvidia_nemotron_nano_9b_v2.yaml`, you can run the following command:
94
+
87
95
88
96
```bash
89
97
nemo-evaluator-launcher run --config-dir packages/nemo-evaluator-launcher/examples --config-name local_nvidia_nemotron_nano_9b_v2 --override execution.output_dir=<YOUR_OUTPUT_LOCAL_DIR>
90
98
```
91
99
92
-
Upon running this command, you will be able to see a job_id, which can then be used for tracking the job and the reults with all the logs will be available in your `<YOUR_OUTPUT_LOCAL_DIR>`.
100
+
After running this command, you will see a `job_id`, which can be used to track the job and its results. All logs will be available in your `<YOUR_OUTPUT_LOCAL_DIR>`.
93
101
94
102
#### 4. Check Your Results
95
-
Results, logs, and run configurations are saved locally. Inspect the status of the evaluation job by using the corresponding job id:
103
+
104
+
Results, logs, and run configurations are saved locally. Inspect the status of the evaluation job by using the corresponding `job_id`:
96
105
97
106
```bash
98
107
nemo-evaluator-launcher status <job_id_or_invocation_id>
@@ -101,15 +110,16 @@ nemo-evaluator-launcher status <job_id_or_invocation_id>
101
110
#### Next Steps
102
111
103
112
- List all supported benchmarks:
104
-
```bash
105
-
nemo-evaluator-launcher ls tasks
106
-
```
113
+
114
+
```bash
115
+
nemo-evaluator-launcher ls tasks
116
+
```
117
+
107
118
- Explore the [Supported Benchmarks](#supported-benchmarks-and-evaluation-harnesses) to see all available harnesses and benchmarks.
108
119
- Scale up your evaluations using the [Slurm Executor](./docs/nemo-evaluator-launcher/executors/slurm.md) or [Lepton Executor](./docs/nemo-evaluator-launcher/executors/lepton.md).
109
120
- Learn to evaluate self-hosted models in the extended [Tutorial guide](./docs/nemo-evaluator-launcher/tutorial.md) for nemo-evaluator-launcher.
110
121
- Customize your workflow with [Custom Exporters](./docs/nemo-evaluator-launcher/exporters/overview.md) or by evaluating with [proprietary data](./docs/nemo-evaluator/extending/framework-definition-file.md).
111
122
112
-
113
123
### Supported Benchmarks and Evaluation Harnesses
114
124
115
125
NeMo Evaluator Launcher provides pre-built evaluation containers for different evaluation harnesses through the NVIDIA NGC catalog. Each harness supports a variety of benchmarks, which can then be called via `nemo-evaluator`. This table provides a list of benchmark names per harness. A more detailed list of task names can be found in the [list of NGC containers](./docs/nemo-evaluator/index.md#ngc-containers).
@@ -134,8 +144,6 @@ NeMo Evaluator Launcher provides pre-built evaluation containers for different e
|**vlmevalkit**| Vision-language model evaluation |[Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit)|`25.08.1`| AI2D, ChartQA, OCRBench, SlideVQA |
136
146
137
-
138
-
139
147
### Contribution Guide
140
-
We welcome community contributions. Please see our [Contribution Guide](https://github.com/NVIDIA-NeMo/Eval/blob/main/CONTRIBUTING.md) for instructions on submitting pull requests, reporting issues, and suggesting features.
141
148
149
+
We welcome community contributions. Please see our [Contribution Guide](https://github.com/NVIDIA-NeMo/Eval/blob/main/CONTRIBUTING.md) for instructions on submitting pull requests, reporting issues, and suggesting features.
0 commit comments