You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Models**: Choose any endpoint from NVIDIA Build's extensive catalog
20
+
-**API Key**: Get from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct). See [Setting up API Keys](https://docs.omniverse.nvidia.com/guide-sdg/latest/setup.html#preview-and-set-up-an-api-key).
21
+
Make sure to export the API key:
19
22
20
-
You have two options:
23
+
```
24
+
export NGC_API_KEY=nvapi-...
25
+
```
21
26
22
-
#### Option A: Use NVIDIA Build API or Another Hosted Endpoint
27
+
#### Option II: Another Hosted Endpoint
23
28
24
-
-**URL**: `https://integrate.api.nvidia.com/v1/chat/completions` (or your hosted endpoint)
25
-
-**Models**: You can select any OpenAI‑compatible endpoint, including those from the extensive catalog on NVIDIA Build
26
-
-**API Key**: Get from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct) (or your provider)
27
-
- For NVIDIA APIs, see [Setting up API Keys](https://docs.omniverse.nvidia.com/guide-sdg/latest/setup.html#preview-and-set-up-an-api-key)
29
+
-**URL**: Your model's endpoint URL
30
+
-**Models**: Any OpenAI-compatible endpoint
31
+
-**API_KEY**: If your endpoint is gated, get an API key from your provider and export it:
32
+
33
+
```
34
+
export API_KEY=...
35
+
```
28
36
29
-
#### Option B: Deploy Your Own Endpoint
37
+
#### Option III: Deploy Your Own Endpoint
30
38
31
-
Deploy an OpenAI-compatible endpoint using frameworks like vLLM, SGLang, TRT-LLM, or NIM. Refer to {ref}`bring-your-own-endpoint-manual` for deployment guidance
39
+
Deploy an OpenAI-compatible endpoint using frameworks like vLLM, SGLang, TRT-LLM, or NIM.
40
+
<!-- TODO: uncomment ref once the guide is ready -->
41
+
<!-- Refer to {ref}`bring-your-own-endpoint-manual` for deployment guidance -->
32
42
33
43
:::{note}
34
-
For this tutorial we will use `meta/llama-3.1-8b-instruct` from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct).
44
+
For this tutorial, we will use `meta/llama-3.1-8b-instruct` from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct). You will need to export your `NGC_API_KEY` to access this endpoint.
35
45
:::
36
46
37
47
### 2. Select Tasks
38
48
39
-
Choose which benchmarks to evaluate. Available tasks include:
49
+
Choose which benchmarks to evaluate. You can list all available tasks with the following command:
40
50
41
51
```bash
42
52
nemo-evaluator-launcher ls tasks
@@ -47,70 +57,119 @@ For a comprehensive list of supported tasks and descriptions, see {ref}`nemo-eva
47
57
**Important**: Each task has a dedicated endpoint type (e.g., `/v1/chat/completions`, `/v1/completions`). Ensure that your model provides the correct endpoint type for the tasks you want to evaluate. Use our {ref}`deployment-testing-compatibility` guide to verify your endpoint supports the required formats.
48
58
49
59
:::{note}
50
-
For this tutorial we will pick: `ifeval` and `humaneval_instruct` as these are fast, both use the chat endpoint.
60
+
For this tutorial we will pick: `ifeval` and `humaneval_instruct` as these are fast. They both use the chat endpoint.
51
61
:::
52
62
53
-
### 3. Create Configuration File
63
+
### 3. Create a Configuration File
54
64
55
-
Create a `configs` directory and your first configuration file:
65
+
Create a `configs` directory:
56
66
57
67
```bash
58
68
mkdir configs
59
69
```
60
70
61
-
Create a configuration file with a descriptive name (e.g., `configs/local_endpoint.yaml`):
62
-
63
-
This configuration will create evaluations for 2 tasks: `ifeval` and `humaneval_instruct`. You can display the whole configuration and scripts which will be executed using `--dry-run`
71
+
Create a configuration file with a descriptive name (e.g., `configs/local_endpoint.yaml`)
72
+
and populate it with the following content:
64
73
65
74
```yaml
66
75
defaults:
67
-
- execution: local
68
-
- deployment: none
76
+
- execution: local# The evaluation will run locally on your machine using Docker
77
+
- deployment: none# Since we are evaluating an existing endpoint, we don't need to deploy the model
output_dir: results/${target.api_endpoint.model_id} # Logs and artifacts will be saved here
82
+
mode: sequential # Default: run tasks sequentially. You can also use the mode 'parallel'
73
83
74
84
target:
75
85
api_endpoint:
76
86
model_id: meta/llama-3.1-8b-instruct # TODO: update to the model you want to evaluate
77
87
url: https://integrate.api.nvidia.com/v1/chat/completions # TODO: update to the endpoint you want to evaluate
78
-
api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com or model of your choice
88
+
api_key_name: NGC_API_KEY #Name of the env variable that stores the API Key with access to build.nvidia.com (or model of your choice)
79
89
80
90
# specify the benchmarks to evaluate
81
91
evaluation:
82
-
overrides: # these overrides apply to all tasks; for task-specific overrides, use the `overrides` field
83
-
config.params.request_timeout: 3600
92
+
# Optional: Global evaluation overrides - these apply to all benchmarks below
93
+
nemo_evaluator_config:
94
+
config:
95
+
params:
96
+
parallelism: 2
97
+
request_timeout: 1600
84
98
tasks:
85
99
- name: ifeval # use the default benchmark configuration
86
100
- name: humaneval_instruct
101
+
# Optional: Task overrides - here they apply only to the task `humaneval_instruct`
102
+
nemo_evaluator_config:
103
+
config:
104
+
params:
105
+
max_new_tokens: 1024
106
+
temperature: 0.3
107
+
```
108
+
109
+
This configuration will create evaluations for 2 tasks: `ifeval` and `humaneval_instruct`.
110
+
111
+
You can display the whole configuration and scripts which will be executed using `--dry-run`:
112
+
87
113
```
114
+
nemo-evaluator-launcher run --config-dir configs --config-name local_endpoint --dry-run
115
+
```
116
+
117
+
### 4. Run the Evaluation
88
118
89
-
### 4. Run Evaluation
119
+
Once your configuration file is complete, you can run the evaluations:
90
120
91
121
```bash
92
-
nemo-evaluator-launcher run --config-dir configs --config-name local_endpoint \
93
-
-o target.api_endpoint.api_key_name=NGC_API_KEY
122
+
nemo-evaluator-launcher run --config-dir configs --config-name local_endpoint
94
123
```
95
124
96
125
### 5. Run the Same Evaluation for a Different Model (Using CLI Overrides)
126
+
You can override the values from your configuration file using CLI overrides:
97
127
98
128
```bash
99
-
exportNGC_API_KEY=<YOUR MODEL API KEY>
129
+
export API_KEY=<YOUR MODEL API KEY>
100
130
MODEL_NAME=<YOUR_MODEL_NAME>
101
131
URL=<YOUR_ENDPOINT_URL> # Note: endpoint URL needs to be FULL (e.g., https://api.example.com/v1/chat/completions)
102
132
103
133
nemo-evaluator-launcher run --config-dir configs --config-name local_endpoint \
104
134
-o target.api_endpoint.model_id=$MODEL_NAME \
105
135
-o target.api_endpoint.url=$URL \
106
-
-o target.api_endpoint.api_key_name=NGC_API_KEY
136
+
-o target.api_endpoint.api_key_name=API_KEY
137
+
```
138
+
139
+
### 6. Check the Job Status and Results
140
+
141
+
List the runs from last 2 hours to see the invocation IDs of the two evaluation jobs:
142
+
143
+
```bash
144
+
nemo-evaluator-launcher ls runs --since 2h # list runs from last 2 hours
145
+
```
146
+
147
+
Use the IDs to check the jobs statuses:
148
+
149
+
```bash
150
+
nemo-evaluator-launcher status <invocation_id1> <invocation_id2> --json
151
+
```
152
+
153
+
When jobs finish, you can display results and export them using the available exporters:
154
+
155
+
```bash
156
+
# Check the results
157
+
cat results/*/artifacts/results.yml
158
+
159
+
# Check the running logs
160
+
tail -f results/*/*/logs/stdout.log # use the output_dir printed by the run command
161
+
162
+
# Export metrics and metadata from both runs to json
163
+
nemo-evaluator-launcher export <invocation_id1> <invocation_id2> --dest local --format json
164
+
cat processed_results.json
107
165
```
108
166
109
-
After launching, you can view logs and job status. When jobs finish, you can display results and export them using the available exporters. Refer to {ref}`exporters-overview` for available export options.
167
+
Refer to {ref}`exporters-overview` for available export options.
110
168
111
169
## Next Steps
112
170
113
171
- **{ref}`evaluation-configuration`**: Customize evaluation parameters and prompts
114
172
- **{ref}`executors-overview`**: Try Slurm or Lepton for different environments
115
-
-**{ref}`bring-your-own-endpoint-manual`**: Deploy your own endpoints with various frameworks
173
+
<!-- TODO: uncoment once ready -->
174
+
<!-- - **{ref}`bring-your-own-endpoint-manual`**: Deploy your own endpoints with various frameworks -->
116
175
- **{ref}`exporters-overview`**: Send results to W&B, MLFlow, or other platforms
0 commit comments