Skip to content

Commit 027b086

Browse files
docs(tutorial): local evaluation of existing endpoint (#386)
Signed-off-by: Ewa Dobrowolska <[email protected]> Signed-off-by: Marta Stepniewska-Dziubinska <[email protected]> Co-authored-by: Ewa Dobrowolska <[email protected]>
1 parent 5d8e9f8 commit 027b086

File tree

2 files changed

+96
-37
lines changed

2 files changed

+96
-37
lines changed

docs/tutorials/local-evaluation-of-existing-endpoint.md

Lines changed: 95 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -4,39 +4,49 @@ This tutorial shows how to evaluate an existing API endpoint using the Local exe
44

55
## Prerequisites
66

7-
### Installation
7+
- Docker
8+
- Python environment with the NeMo Evaluator Launcher CLI available (install the launcher by following {ref}`gs-install`)
89

9-
First, install the NeMo Evaluator Launcher. Refer to {ref}`gs-install` for detailed setup instructions.
10+
## Step-by-Step Guide
1011

11-
### Requirements
12+
### 1. Select a Model
1213

13-
- Docker
14-
- Python environment with the NeMo Evaluator Launcher CLI available
14+
You have the following options:
1515

16-
## Step-by-Step Guide
16+
#### Option I: Use the NVIDIA Build API
1717

18-
### 1. Select Model
18+
- **URL**: `https://integrate.api.nvidia.com/v1/chat/completions`
19+
- **Models**: Choose any endpoint from NVIDIA Build's extensive catalog
20+
- **API Key**: Get from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct). See [Setting up API Keys](https://docs.omniverse.nvidia.com/guide-sdg/latest/setup.html#preview-and-set-up-an-api-key).
21+
Make sure to export the API key:
1922

20-
You have two options:
23+
```
24+
export NGC_API_KEY=nvapi-...
25+
```
2126

22-
#### Option A: Use NVIDIA Build API or Another Hosted Endpoint
27+
#### Option II: Another Hosted Endpoint
2328

24-
- **URL**: `https://integrate.api.nvidia.com/v1/chat/completions` (or your hosted endpoint)
25-
- **Models**: You can select any OpenAI‑compatible endpoint, including those from the extensive catalog on NVIDIA Build
26-
- **API Key**: Get from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct) (or your provider)
27-
- For NVIDIA APIs, see [Setting up API Keys](https://docs.omniverse.nvidia.com/guide-sdg/latest/setup.html#preview-and-set-up-an-api-key)
29+
- **URL**: Your model's endpoint URL
30+
- **Models**: Any OpenAI-compatible endpoint
31+
- **API_KEY**: If your endpoint is gated, get an API key from your provider and export it:
32+
33+
```
34+
export API_KEY=...
35+
```
2836

29-
#### Option B: Deploy Your Own Endpoint
37+
#### Option III: Deploy Your Own Endpoint
3038

31-
Deploy an OpenAI-compatible endpoint using frameworks like vLLM, SGLang, TRT-LLM, or NIM. Refer to {ref}`bring-your-own-endpoint-manual` for deployment guidance
39+
Deploy an OpenAI-compatible endpoint using frameworks like vLLM, SGLang, TRT-LLM, or NIM.
40+
<!-- TODO: uncomment ref once the guide is ready -->
41+
<!-- Refer to {ref}`bring-your-own-endpoint-manual` for deployment guidance -->
3242

3343
:::{note}
34-
For this tutorial we will use `meta/llama-3.1-8b-instruct` from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct).
44+
For this tutorial, we will use `meta/llama-3.1-8b-instruct` from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct). You will need to export your `NGC_API_KEY` to access this endpoint.
3545
:::
3646

3747
### 2. Select Tasks
3848

39-
Choose which benchmarks to evaluate. Available tasks include:
49+
Choose which benchmarks to evaluate. You can list all available tasks with the following command:
4050

4151
```bash
4252
nemo-evaluator-launcher ls tasks
@@ -47,70 +57,119 @@ For a comprehensive list of supported tasks and descriptions, see {ref}`nemo-eva
4757
**Important**: Each task has a dedicated endpoint type (e.g., `/v1/chat/completions`, `/v1/completions`). Ensure that your model provides the correct endpoint type for the tasks you want to evaluate. Use our {ref}`deployment-testing-compatibility` guide to verify your endpoint supports the required formats.
4858

4959
:::{note}
50-
For this tutorial we will pick: `ifeval` and `humaneval_instruct` as these are fast, both use the chat endpoint.
60+
For this tutorial we will pick: `ifeval` and `humaneval_instruct` as these are fast. They both use the chat endpoint.
5161
:::
5262

53-
### 3. Create Configuration File
63+
### 3. Create a Configuration File
5464

55-
Create a `configs` directory and your first configuration file:
65+
Create a `configs` directory:
5666

5767
```bash
5868
mkdir configs
5969
```
6070

61-
Create a configuration file with a descriptive name (e.g., `configs/local_endpoint.yaml`):
62-
63-
This configuration will create evaluations for 2 tasks: `ifeval` and `humaneval_instruct`. You can display the whole configuration and scripts which will be executed using `--dry-run`
71+
Create a configuration file with a descriptive name (e.g., `configs/local_endpoint.yaml`)
72+
and populate it with the following content:
6473

6574
```yaml
6675
defaults:
67-
- execution: local
68-
- deployment: none
76+
- execution: local # The evaluation will run locally on your machine using Docker
77+
- deployment: none # Since we are evaluating an existing endpoint, we don't need to deploy the model
6978
- _self_
7079

7180
execution:
72-
output_dir: results/${target.api_endpoint.model_id}
81+
output_dir: results/${target.api_endpoint.model_id} # Logs and artifacts will be saved here
82+
mode: sequential # Default: run tasks sequentially. You can also use the mode 'parallel'
7383

7484
target:
7585
api_endpoint:
7686
model_id: meta/llama-3.1-8b-instruct # TODO: update to the model you want to evaluate
7787
url: https://integrate.api.nvidia.com/v1/chat/completions # TODO: update to the endpoint you want to evaluate
78-
api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com or model of your choice
88+
api_key_name: NGC_API_KEY # Name of the env variable that stores the API Key with access to build.nvidia.com (or model of your choice)
7989

8090
# specify the benchmarks to evaluate
8191
evaluation:
82-
overrides: # these overrides apply to all tasks; for task-specific overrides, use the `overrides` field
83-
config.params.request_timeout: 3600
92+
# Optional: Global evaluation overrides - these apply to all benchmarks below
93+
nemo_evaluator_config:
94+
config:
95+
params:
96+
parallelism: 2
97+
request_timeout: 1600
8498
tasks:
8599
- name: ifeval # use the default benchmark configuration
86100
- name: humaneval_instruct
101+
# Optional: Task overrides - here they apply only to the task `humaneval_instruct`
102+
nemo_evaluator_config:
103+
config:
104+
params:
105+
max_new_tokens: 1024
106+
temperature: 0.3
107+
```
108+
109+
This configuration will create evaluations for 2 tasks: `ifeval` and `humaneval_instruct`.
110+
111+
You can display the whole configuration and scripts which will be executed using `--dry-run`:
112+
87113
```
114+
nemo-evaluator-launcher run --config-dir configs --config-name local_endpoint --dry-run
115+
```
116+
117+
### 4. Run the Evaluation
88118

89-
### 4. Run Evaluation
119+
Once your configuration file is complete, you can run the evaluations:
90120

91121
```bash
92-
nemo-evaluator-launcher run --config-dir configs --config-name local_endpoint \
93-
-o target.api_endpoint.api_key_name=NGC_API_KEY
122+
nemo-evaluator-launcher run --config-dir configs --config-name local_endpoint
94123
```
95124

96125
### 5. Run the Same Evaluation for a Different Model (Using CLI Overrides)
126+
You can override the values from your configuration file using CLI overrides:
97127

98128
```bash
99-
export NGC_API_KEY=<YOUR MODEL API KEY>
129+
export API_KEY=<YOUR MODEL API KEY>
100130
MODEL_NAME=<YOUR_MODEL_NAME>
101131
URL=<YOUR_ENDPOINT_URL> # Note: endpoint URL needs to be FULL (e.g., https://api.example.com/v1/chat/completions)
102132
103133
nemo-evaluator-launcher run --config-dir configs --config-name local_endpoint \
104134
-o target.api_endpoint.model_id=$MODEL_NAME \
105135
-o target.api_endpoint.url=$URL \
106-
-o target.api_endpoint.api_key_name=NGC_API_KEY
136+
-o target.api_endpoint.api_key_name=API_KEY
137+
```
138+
139+
### 6. Check the Job Status and Results
140+
141+
List the runs from last 2 hours to see the invocation IDs of the two evaluation jobs:
142+
143+
```bash
144+
nemo-evaluator-launcher ls runs --since 2h # list runs from last 2 hours
145+
```
146+
147+
Use the IDs to check the jobs statuses:
148+
149+
```bash
150+
nemo-evaluator-launcher status <invocation_id1> <invocation_id2> --json
151+
```
152+
153+
When jobs finish, you can display results and export them using the available exporters:
154+
155+
```bash
156+
# Check the results
157+
cat results/*/artifacts/results.yml
158+
159+
# Check the running logs
160+
tail -f results/*/*/logs/stdout.log # use the output_dir printed by the run command
161+
162+
# Export metrics and metadata from both runs to json
163+
nemo-evaluator-launcher export <invocation_id1> <invocation_id2> --dest local --format json
164+
cat processed_results.json
107165
```
108166

109-
After launching, you can view logs and job status. When jobs finish, you can display results and export them using the available exporters. Refer to {ref}`exporters-overview` for available export options.
167+
Refer to {ref}`exporters-overview` for available export options.
110168

111169
## Next Steps
112170

113171
- **{ref}`evaluation-configuration`**: Customize evaluation parameters and prompts
114172
- **{ref}`executors-overview`**: Try Slurm or Lepton for different environments
115-
- **{ref}`bring-your-own-endpoint-manual`**: Deploy your own endpoints with various frameworks
173+
<!-- TODO: uncoment once ready -->
174+
<!-- - **{ref}`bring-your-own-endpoint-manual`**: Deploy your own endpoints with various frameworks -->
116175
- **{ref}`exporters-overview`**: Send results to W&B, MLFlow, or other platforms

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/resources/mapping.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ required_env_vars = []
226226
[bigcode-evaluation-harness.tasks.completions.humaneval]
227227
required_env_vars = []
228228

229-
[bigcode-evaluation-harness.tasks.completions.humaneval_instruct]
229+
[bigcode-evaluation-harness.tasks.chat.humaneval_instruct]
230230

231231

232232
###############################################################################

0 commit comments

Comments
 (0)