Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 3 additions & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -402,12 +402,10 @@ Setup & Installation <troubleshooting/setup-issues/index>
Runtime & Execution <troubleshooting/runtime-issues/index>
::: -->

<!-- :::{toctree}
:::{toctree}
:caption: References
:hidden:

About References <references/index>
Eval Parameters <evaluation/parameters>
Eval Utils <references/evaluation-utils>
API Documentation <apidocs/index.rst>
::: -->
FAQ <references/faq>
:::
1 change: 1 addition & 0 deletions docs/libraries/nemo-evaluator-launcher/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ nemo-evaluator-launcher run --config-dir packages/nemo-evaluator-launcher/exampl
-o +config.params.limit_samples=10
```

(launcher-cli-dry-run)=
### Dry Run

Preview the full resolved configuration without executing:
Expand Down
4 changes: 4 additions & 0 deletions docs/references/evaluation-utils.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
---
orphan: true
---

(evaluation-utils-reference)=

# Evaluation Utilities Reference
Expand Down
323 changes: 323 additions & 0 deletions docs/references/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,323 @@
# Frequently Asked Questions

## **What benchmarks and harnesses are supported?**

The docs list hundreds of benchmarks across multiple harnesses, available via curated NGC evaluation containers and the unified Launcher.

Reference: {ref}`eval-benchmarks`

:::{tip}
Discover available tasks with

```bash
nemo-evaluator-launcher ls tasks
```
:::

---

## **How do I set log dir and verbose logging?**

Set these environment variables for logging configuration:

```bash
# Set log level (INFO, DEBUG, WARNING, ERROR, CRITICAL)
export LOG_LEVEL=DEBUG
# or (legacy, still supported)
export NEMO_EVALUATOR_LOG_LEVEL=DEBUG
```

Reference: {ref}`nemo-evaluator-logging`.

---

## **Can I run distributed or on a scheduler?**

Yes. Launcher supports multiple executors. For optimal performance, the SLURM executor is recommended. It schedules and executes jobs across cluster nodes, enabling parallel, large‑scale evaluation runs while preserving reproducibility via containerized benchmarks.

See {ref}`executor-slurm` for details.

---

## **Can I point Evaluator at my own endpoint?**

Yes. Provide your OpenAI‑compatible endpoint. The "none" deployment option means no model deployment is performed as part of the evaluation job. Instead, you provide an existing OpenAI-compatible endpoint. The launcher handles running evaluation tasks while connecting to your existing endpoint.

```yaml
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct # Model identifier (required)
url: https://your-endpoint.com/v1/chat/completions # Endpoint URL (required)
api_key_name: API_KEY # Environment variable name (recommended)

```

Reference: {ref}`deployment-none`.

---

**Can I test my endpoint for OpenAI compatibility?**

Yes. Preview the full resolved configuration without executing using `--dry-run` :

```bash
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct --dry-run
```

Reference: {ref}`launcher-cli-dry-run`.

---

## **Can I store and retrieve per-sample results, not just the summary?**

Yes. Capture full request/response artifacts and retrieve them from the run's artifacts folder.

Enable detailed logging with `nemo_evaluator_config`:

```yaml
evaluation:
# Request + response logging (example at 1k each)
nemo_evaluator_config:
target:
api_endpoint:
adapter_config:
use_request_logging: True
max_saved_requests: 1000
use_response_logging: True
max_saved_responses: 1000
```

These enable the **RequestLoggingInterceptor** and **ResponseLoggingInterceptor** so each prompt/response pair is saved alongside the evaluation job.

Retrieve artifacts after the run:

```bash
nemo-evaluator-launcher export <invocation_id> --dest local --output-dir ./artifacts --copy-logs
```

Look under `./artifacts/` for `results.yml`, reports, logs, and saved request/response files.

Reference: {ref}`interceptor-request-logging`.

---

## **Where do I find evaluation results?**

After a run completes, copy artifacts locally:

```bash
nemo-evaluator-launcher info <invocation_id> --copy-artifacts ./artifacts
```

Inside `./artifacts/` you'll see the run config, `results.yaml` (main output file), HTML/JSON reports, logs, and cached request/response files, if caching was used.

Where the output is structured:

```bash
<output_dir>/
│ ├── eval_factory_metrics.json
│ ├── report.html
│ ├── report.json
│ ├── results.yml
│ ├── run_config.yml
│ └── <Task specific arifacts>/
```

Reference: {ref}`evaluation-output`.

---

## **Can I export a consolidated JSON of scores?**

Yes. JSON is included in the standard output exporter, along with automatic exporters for MLflow, Weights & Biases, and Google Sheets.

```bash
nemo-evaluator-launcher export <invocation_id> --dest local --format json
```

This creates `processed_results.json` (you can also pass multiple invocation IDs to merge).

**Exporter docs:** Local files, W&B, MLflow, GSheets are listed under **Launcher → Exporters** in the docs.

Reference: {ref}`exporters-overview`.

---

## **What's the difference between Launcher and Core?**

* **Launcher (`nemo-evaluator-launcher`)**: Unified CLI with config/exec backends (local/Slurm/Lepton), container orchestration, and exporters. Best for most users. See {ref}`lib-launcher`.
* **Core (`nemo-evaluator`)**: Direct access to the evaluation engine and adapters—useful for custom programmatic pipelines and advanced interceptor use. See {ref}`lib-core`.

---

## **Can I add a new benchmark?**

Yes. Use a **Framework Definition File (FDF)**—a YAML that declares framework metadata, default commands/params, and one or more evaluation tasks. Minimal flow:

1. Create an FDF with `framework`, `defaults`, and `evaluations` sections.
2. Point the launcher/Core at your FDF and run.
3. (Recommended) Package as a container for reproducibility and shareability. See {ref}`extending-evaluator`.

**Skeleton FDF (excerpt):**

```yaml
framework:
name: my-custom-eval
pkg_name: my_custom_eval
defaults:
command: >-
my-eval-cli --model {{target.api_endpoint.model_id}}
--task {{config.params.task}}
--output {{config.output_dir}}
evaluations:
- name: my_task_1
defaults:
config:
params:
task: my_task_1
```

See the "Framework Definition File (FDF)" page for the full example and field reference.

Reference: {ref}`framework-definition-file`.

---

## **Why aren't exporters included in the main wheel?**

Exporters target **external systems** (e.g., W&B, MLflow, Google Sheets). Each of those adds heavy/optional dependencies and auth integrations. To keep the base install lightweight and avoid forcing unused deps on every user, exporters ship as **optional extras**:

```bash
# Only what you need
pip install "nemo-evaluator-launcher[wandb]"
pip install "nemo-evaluator-launcher[mlflow]"
pip install "nemo-evaluator-launcher[gsheets]"

# Or everything
pip install "nemo-evaluator-launcher[all]"
```

**Exporter docs:** Local files, W&B, MLflow, GSheets are listed under {ref}`exporters-overview`.

---

## **How is input configuration managed?**

NeMo Evaluator uses **Hydra** for configuration management, allowing flexible composition, inheritance, and command-line overrides.

Each evaluation is defined by a YAML configuration file that includes four primary sections:

```yaml
defaults:
- execution: local
- deployment: none
- _self_

execution:
output_dir: results

target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY

evaluation:
- name: gpqa_diamond
- name: ifeval
```

This structure defines **where to run**, **how to serve the model**, **which model or endpoint to evaluate**, and **what benchmarks to execute**.

You can start from a provided example config or compose your own using Hydra's `defaults` list to combine deployment, execution, and benchmark modules.

Reference: {ref}`configuration-overview`.

---

## **Can I customize or override configuration values?**

Yes. You can override any field in the YAML file directly from the command line using the `-o` flag:

```bash
# Override output directory
nemo-evaluator-launcher run --config-name your_config \
-o execution.output_dir=my_results

# Override multiple fields
nemo-evaluator-launcher run --config-name your_config \
-o target.api_endpoint.url="https://new-endpoint.com/v1/chat/completions" \
-o target.api_endpoint.model_id=openai/gpt-4o
```

Overrides are merged dynamically at runtime—ideal for testing new endpoints, swapping models, or changing output destinations without editing your base config.

:::{tip}
Always start with a dry run to validate your configuration before launching a full evaluation:

```bash
nemo-evaluator-launcher run --config-name your_config --dry-run
```
:::

Reference: {ref}`configuration-overview`.

---

## **How do I choose the right deployment and execution configuration?**

NeMo Evaluator separates **deployment** (how your model is served) from **execution** (where your evaluations are run). These are configured in the `defaults` section of your YAML file:

```yaml
defaults:
- execution: local # Where to run: local, lepton, or slurm
- deployment: none # How to serve the model: none, vllm, sglang, nim, trtllm, generic

```

**Deployment Options — How your model is served**

| Option | Description | Best for |
| ----- | ----- | ----- |
| `none` | Uses an existing API endpoint (e.g., NVIDIA API Catalog, OpenAI, Anthropic). No deployment needed. | External APIs or already-deployed services |
| `vllm` | High-performance inference server for LLMs with tensor parallelism and caching. | Fast local/cluster inference, production workloads |
| `sglang` | Lightweight structured generation server optimized for throughput. | Evaluating structured or long-form text generation |
| `nim` | NVIDIA Inference Microservice (NIM) – optimized for enterprise-grade serving with autoscaling and telemetry. | Enterprise, production, and reproducible benchmarks |
| `trtllm` | TensorRT-LLM backend using GPU-optimized kernels. | Lowest latency and highest GPU efficiency |
| `generic` | Use a custom serving stack of your choice. | Custom frameworks or experimental endpoints |

**Execution Platforms — Where evaluations run**

| Platform | Description | Use case |
| ----- | ----- | ----- |
| `local` | Runs Docker-based evaluation locally. | Development, testing, or small-scale benchmarking |
| `lepton` | Runs on NVIDIA Lepton for on-demand GPU execution. | Scalable, production-grade evaluations |
| `slurm` | Uses your HPC cluster's job scheduler. | Research clusters or large batch evaluations |

**Example:**

```yaml
defaults:
- execution: lepton
- deployment: vllm
```

This configuration launches the model with **vLLM serving** and runs benchmarks remotely on **Lepton GPUs**.

When in doubt:

* Use `deployment: none` + `execution: local` for your **first run** (quickest setup).
* Use `vllm` or `nim` once you need **scalability and speed**.

Always test first:


```bash
nemo-evaluator-launcher run --config-name your_config --dry-run
```

Reference: {ref}`configuration-overview`.

---
Loading
Loading