diff --git a/README.md b/README.md index 0b146156..2c673903 100644 --- a/README.md +++ b/README.md @@ -51,7 +51,6 @@ NeMo Evaluator Launcher provides pre-built evaluation containers for different e | **mmath** | Multilingual math reasoning | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath) | `25.09.1` | EN, ZH, AR, ES, FR, JA, KO, PT, TH, VI | | **mtbench** | Multi-turn conversation evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) | `25.09.1` | MT-Bench | | **nemo-skills** | Language model benchmarks (science, math, agentic) | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo_skills) | `25.09.1` | AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro | -| **mtbench** | Multi-turn conversation evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) | `25.09.1` | MT-Bench | | **profbench** | Professional domains in Business and Scientific Research | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/profbench) | `25.09.1` | ProfBench | | **rag_retriever_eval** | RAG system evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval) | `25.09.1` | RAG, Retriever | | **safety-harness** | Safety and bias evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness) | `25.09.1` | Aegis v2, BBQ, WildGuard | diff --git a/docs/_resources/tasks-table.md b/docs/_resources/tasks-table.md new file mode 100644 index 00000000..73fc2254 --- /dev/null +++ b/docs/_resources/tasks-table.md @@ -0,0 +1,116 @@ + +```{list-table} +:header-rows: 1 +:widths: 20 25 15 15 25 + +* - Container + - Description + - NGC Catalog + - Latest Tag + - Key Benchmarks +* - **agentic_eval** + - Agentic AI evaluation framework + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/agentic_eval) + - {{ docker_compose_latest }} + - agentic_eval_answer_accuracy, agentic_eval_goal_accuracy_with_reference, agentic_eval_goal_accuracy_without_reference, agentic_eval_topic_adherence, agentic_eval_tool_call_accuracy +* - **bfcl** + - Function calling evaluation + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl) + - {{ docker_compose_latest }} + - bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting +* - **bigcode-evaluation-harness** + - Code generation evaluation + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness) + - {{ docker_compose_latest }} + - humaneval, humaneval_instruct, humanevalplus, mbpp, mbppplus, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts +* - **compute-eval** + - CUDA code evaluation + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/compute-eval) + - {{ docker_compose_latest }} + - cccl_problems, combined_problems, cuda_problems +* - **garak** + - Security and robustness testing + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak) + - {{ docker_compose_latest }} + - garak +* - **genai-perf** + - GenAI performance benchmarking + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/genai-perf) + - {{ docker_compose_latest }} + - genai_perf_generation, genai_perf_summarization +* - **helm** + - Holistic evaluation framework + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm) + - {{ docker_compose_latest }} + - ci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med +* - **hle** + - Academic knowledge and problem solving + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle) + - {{ docker_compose_latest }} + - hle +* - **ifbench** + - Instruction following evaluation + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench) + - {{ docker_compose_latest }} + - ifbench +* - **livecodebench** + - Live coding evaluation + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench) + - {{ docker_compose_latest }} + - AA_codegeneration, codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225 +* - **lm-evaluation-harness** + - Language model benchmarks + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness) + - {{ docker_compose_latest }} + - adlr_arc_challenge_llama, adlr_gsm8k_fewshot_cot, adlr_humaneval_greedy, adlr_humanevalplus_greedy, adlr_mbpp_sanitized_3shot_greedy, adlr_mbppplus_greedy_sanitized, adlr_minerva_math_nemo, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq, commonsense_qa, frames_naive, frames_naive_with_links, frames_oracle, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gpqa_diamond_cot_5_shot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str, mbpp_plus, mgsm, mgsm_cot, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox, mmlu_prox_de, mmlu_prox_es, mmlu_prox_fr, mmlu_prox_it, mmlu_prox_ja, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, winogrande +* - **mmath** + - Multilingual math reasoning + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath) + - {{ docker_compose_latest }} + - mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh +* - **mtbench** + - Multi-turn conversation evaluation + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) + - {{ docker_compose_latest }} + - mtbench, mtbench-cor1 +* - **nemo-skills** + - Language model benchmarks (science, math, agentic) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo_skills) + - {{ docker_compose_latest }} + - ns_aime2024, ns_aime2025, ns_aime2025_ef, ns_bfcl_v3, ns_gpqa, ns_gpqa_ef, ns_hle, ns_livecodebench, ns_mmlu, ns_mmlu_pro +* - **profbench** + - Professional domains in Business and Scientific Research + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/profbench) + - {{ docker_compose_latest }} + - report_generation, llm_judge +* - **rag_retriever_eval** + - RAG system evaluation + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval) + - {{ docker_compose_latest }} + - RAG, Retriever +* - **safety-harness** + - Safety and bias evaluation + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness) + - {{ docker_compose_latest }} + - aegis_v2, aegis_v2_ar, aegis_v2_de, aegis_v2_es, aegis_v2_fr, aegis_v2_hi, aegis_v2_ja, aegis_v2_reasoning, aegis_v2_th, aegis_v2_zh-CN, bbq_full, bbq_small, wildguard +* - **scicode** + - Coding for scientific research + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode) + - {{ docker_compose_latest }} + - aa_scicode, scicode, scicode_background +* - **simple-evals** + - Basic evaluation tasks + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals) + - {{ docker_compose_latest }} + - AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, aime_2024_nemo, aime_2025_nemo, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_nemo, gpqa_extended, gpqa_main, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa +* - **tooltalk** + - Tool usage evaluation + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) + - {{ docker_compose_latest }} + - tooltalk +* - **vlmevalkit** + - Vision-language model evaluation + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit) + - {{ docker_compose_latest }} + - ai2d_judge, chartqa, ocrbench, slidevqa +``` diff --git a/docs/about/key-features.md b/docs/about/key-features.md index 519f9713..5079bf22 100644 --- a/docs/about/key-features.md +++ b/docs/about/key-features.md @@ -57,115 +57,7 @@ nemo-evaluator-launcher export --dest gsheets ### Container-First Architecture Pre-built NGC containers guarantee reproducible results across environments: -```{list-table} -:header-rows: 1 -:widths: 20 25 15 15 25 - -* - Container - - Description - - NGC Catalog - - Latest Tag - - Key Benchmarks -* - **agentic_eval** - - Agentic AI evaluation framework - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/agentic_eval) - - {{ docker_compose_latest }} - - agentic_eval_answer_accuracy, agentic_eval_goal_accuracy_with_reference, agentic_eval_goal_accuracy_without_reference, agentic_eval_topic_adherence, agentic_eval_tool_call_accuracy -* - **bfcl** - - Function calling evaluation - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl) - - {{ docker_compose_latest }} - - bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting -* - **bigcode-evaluation-harness** - - Code generation evaluation - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness) - - {{ docker_compose_latest }} - - humaneval, humaneval_instruct, humanevalplus, mbpp, mbppplus, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts -* - **compute-eval** - - CUDA code evaluation - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/compute-eval) - - {{ docker_compose_latest }} - - cccl_problems, combined_problems, cuda_problems -* - **garak** - - Security and robustness testing - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak) - - {{ docker_compose_latest }} - - garak -* - **genai-perf** - - GenAI performance benchmarking - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/genai-perf) - - {{ docker_compose_latest }} - - genai_perf_generation, genai_perf_summarization -* - **helm** - - Holistic evaluation framework - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm) - - {{ docker_compose_latest }} - - ci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med -* - **hle** - - Academic knowledge and problem solving - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle) - - {{ docker_compose_latest }} - - hle -* - **ifbench** - - Instruction following evaluation - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench) - - {{ docker_compose_latest }} - - ifbench -* - **livecodebench** - - Live coding evaluation - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench) - - {{ docker_compose_latest }} - - AA_codegeneration, codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225 -* - **lm-evaluation-harness** - - Language model benchmarks - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness) - - {{ docker_compose_latest }} - - adlr_arc_challenge_llama, adlr_gsm8k_fewshot_cot, adlr_humaneval_greedy, adlr_humanevalplus_greedy, adlr_mbpp_sanitized_3shot_greedy, adlr_mbppplus_greedy_sanitized, adlr_minerva_math_nemo, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq, commonsense_qa, frames_naive, frames_naive_with_links, frames_oracle, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gpqa_diamond_cot_5_shot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str, mbpp_plus, mgsm, mgsm_cot, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox, mmlu_prox_de, mmlu_prox_es, mmlu_prox_fr, mmlu_prox_it, mmlu_prox_ja, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, winogrande -* - **mmath** - - Multilingual math reasoning - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath) - - {{ docker_compose_latest }} - - mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh -* - **mtbench** - - Multi-turn conversation evaluation - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) - - {{ docker_compose_latest }} - - mtbench, mtbench-cor1 -* - **nemo-skills** - - Language model benchmarks (science, math, agentic) - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo_skills) - - {{ docker_compose_latest }} - - ns_aime2024, ns_aime2025, ns_aime2025_ef, ns_bfcl_v3, ns_gpqa, ns_gpqa_ef, ns_hle, ns_livecodebench, ns_mmlu, ns_mmlu_pro -* - **rag_retriever_eval** - - RAG system evaluation - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval) - - {{ docker_compose_latest }} - - RAG, Retriever -* - **safety-harness** - - Safety and bias evaluation - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness) - - {{ docker_compose_latest }} - - aegis_v2, aegis_v2_ar, aegis_v2_de, aegis_v2_es, aegis_v2_fr, aegis_v2_hi, aegis_v2_ja, aegis_v2_reasoning, aegis_v2_th, aegis_v2_zh-CN, bbq_full, bbq_small, wildguard -* - **scicode** - - Coding for scientific research - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode) - - {{ docker_compose_latest }} - - aa_scicode, scicode, scicode_background -* - **simple-evals** - - Basic evaluation tasks - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals) - - {{ docker_compose_latest }} - - AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, aime_2024_nemo, aime_2025_nemo, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_nemo, gpqa_extended, gpqa_main, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa -* - **tooltalk** - - Tool usage evaluation - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) - - {{ docker_compose_latest }} - - tooltalk -* - **vlmevalkit** - - Vision-language model evaluation - - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit) - - {{ docker_compose_latest }} - - ai2d_judge, chartqa, ocrbench, slidevqa +```{include} ../_resources/tasks-table.md ``` ```bash @@ -302,7 +194,7 @@ NeMo Evaluator supports OpenAI-compatible API endpoints: - **Hosted Models**: NVIDIA Build, OpenAI, Anthropic, Cohere - **Self-Hosted**: vLLM, TRT-LLM, NeMo Framework -- **Custom Endpoints**: Any service implementing OpenAI API spec (test compatibility with our [Testing Endpoint Compatibility](../deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) guide) +- **Custom Endpoints**: Any service implementing OpenAI API spec (test compatibility with our {ref}`deployment-testing-compatibility` guide) The platform supports the following endpoint types: diff --git a/docs/conf.py b/docs/conf.py index fe8ffaf4..2b9a842b 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -136,7 +136,7 @@ "support_email": "update-me", "min_python_version": "3.8", "recommended_cuda": "12.0+", - "docker_compose_latest": "25.09", + "docker_compose_latest": "25.09.1", } # Enable figure numbering diff --git a/docs/deployment/bring-your-own-endpoint/index.md b/docs/deployment/bring-your-own-endpoint/index.md index dc7fb7cf..055340e8 100644 --- a/docs/deployment/bring-your-own-endpoint/index.md +++ b/docs/deployment/bring-your-own-endpoint/index.md @@ -86,7 +86,7 @@ Your endpoint must provide OpenAI-compatible APIs: - **Health Check**: `/v1/health` (GET) - For monitoring (recommended) ### Request/Response Format -Must follow OpenAI API specifications for compatibility with evaluation frameworks. See the [Testing Endpoint Compatibility](testing-endpoint-oai-compatibility.md) guide to verify your endpoint's OpenAI compatibility. +Must follow OpenAI API specifications for compatibility with evaluation frameworks. See the {ref}`deployment-testing-compatibility` guide to verify your endpoint's OpenAI compatibility. ## Configuration Management diff --git a/docs/deployment/bring-your-own-endpoint/manual-deployment.md b/docs/deployment/bring-your-own-endpoint/manual-deployment.md index 5c1b1c57..111ef435 100644 --- a/docs/deployment/bring-your-own-endpoint/manual-deployment.md +++ b/docs/deployment/bring-your-own-endpoint/manual-deployment.md @@ -27,7 +27,7 @@ This guide focuses on NeMo Evaluator configuration. For specific serving framewo ## Using Manual Deployments with NeMo Evaluator -Before connecting to your manual deployment, verify it's properly configured using our [Testing Endpoint Compatibility](testing-endpoint-oai-compatibility.md) guide. +Before connecting to your manual deployment, verify it's properly configured using our {ref}`deployment-testing-compatibility` guide. ### With Launcher diff --git a/docs/deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md b/docs/deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md index 799cbec9..f4cd305f 100644 --- a/docs/deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md +++ b/docs/deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md @@ -1,3 +1,4 @@ +(deployment-testing-compatibility)= # Testing Endpoint Compatibility This guide helps you test your hosted endpoint to verify OpenAI-compatible API compatibility using `curl` requests for different task types. Models deployed using `nemo-evaluator-launcher` should be compatible with these tests. diff --git a/docs/deployment/index.md b/docs/deployment/index.md index 3a95cdc5..5a30ea45 100644 --- a/docs/deployment/index.md +++ b/docs/deployment/index.md @@ -133,7 +133,7 @@ Choose from these approaches when managing your own deployment: +- **Custom serving**: Any OpenAI-compatible endpoint (verify compatibility with our {ref}`deployment-testing-compatibility` guide) --> ### Hosted Services - **NVIDIA Build**: Ready-to-use hosted models with OpenAI-compatible APIs diff --git a/docs/evaluation/_snippets/commands/list_tasks.sh b/docs/evaluation/_snippets/commands/list_tasks.sh index 31b20964..1930dd7c 100755 --- a/docs/evaluation/_snippets/commands/list_tasks.sh +++ b/docs/evaluation/_snippets/commands/list_tasks.sh @@ -1,14 +1,14 @@ #!/bin/bash -# Task discovery commands for NeMo Evaluator +# Task discovery commands for NeMo Evaluator Launcher # [snippet-start] # List all available benchmarks -nemo-evaluator-launcher ls tasks +nemo-evaluator-launcher ls # Output as JSON for programmatic filtering -nemo-evaluator-launcher ls tasks --json +nemo-evaluator-launcher ls --json -# Filter for specific task types (example: academic benchmarks) -nemo-evaluator-launcher ls tasks | grep -E "(mmlu|gsm8k|arc)" +# Filter for specific task types (example: mmlu, gsm8k, and arc_challenge) +nemo-evaluator-launcher ls | grep -E "(mmlu|gsm8k|arc)" # [snippet-end] diff --git a/docs/evaluation/_snippets/commands/list_tasks_core.sh b/docs/evaluation/_snippets/commands/list_tasks_core.sh new file mode 100644 index 00000000..902bb2b6 --- /dev/null +++ b/docs/evaluation/_snippets/commands/list_tasks_core.sh @@ -0,0 +1,8 @@ +#!/bin/bash +# Task discovery commands for NeMo Evaluator +# FIXME(martas): Hard-code the container version + +# [snippet-start] +# List benchmarks available in the container +docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.09.1 nemo-evaluator ls +# [snippet-end] diff --git a/docs/evaluation/benchmarks.md b/docs/evaluation/benchmarks.md index c37ac8e7..489a94da 100644 --- a/docs/evaluation/benchmarks.md +++ b/docs/evaluation/benchmarks.md @@ -2,13 +2,9 @@ # Benchmark Catalog -Comprehensive catalog of 100+ benchmarks across 18 evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform. +Comprehensive catalog of hundreds of benchmarks across popular evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform. -## Overview - -NeMo Evaluator provides access to benchmarks across multiple domains through pre-built NGC containers and the unified launcher CLI. Each container specializes in different evaluation domains while maintaining consistent interfaces and reproducible results. - ## Available via Launcher ```{literalinclude} _snippets/commands/list_tasks.sh @@ -17,36 +13,32 @@ NeMo Evaluator provides access to benchmarks across multiple domains through pre :end-before: "# [snippet-end]" ``` +## Available via Direct Container Access + +```{literalinclude} _snippets/commands/list_tasks_core.sh +:language: bash +:start-after: "# [snippet-start]" +:end-before: "# [snippet-end]" +``` + ## Choosing Benchmarks for Academic Research :::{admonition} Benchmark Selection Guide :class: tip -**For Language Understanding & General Knowledge**: -Recommended suite for comprehensive model evaluation: +**For General Knowledge**: - `mmlu_pro` - Expert-level knowledge across 14 domains -- `arc_challenge` - Complex reasoning and science questions -- `hellaswag` - Commonsense reasoning about situations -- `truthfulqa` - Factual accuracy vs. plausibility - -```bash -nemo-evaluator-launcher run \ - --config-dir packages/nemo-evaluator-launcher/examples \ - --config-name local_academic_suite \ - -o 'evaluation.tasks=["mmlu_pro", "arc_challenge", "hellaswag", "truthfulqa"]' -``` +- `gpqa_diamond` - Graduate-level science questions **For Mathematical & Quantitative Reasoning**: -- `gsm8k` - Grade school math word problems -- `math` - Competition-level mathematics +- `AIME_2025` - American Invitational Mathematics Examination (AIME) 2025 questions - `mgsm` - Multilingual math reasoning **For Instruction Following & Alignment**: -- `ifeval` - Precise instruction following -- `gpqa_diamond` - Graduate-level science questions +- `ifbench` - Precise instruction following - `mtbench` - Multi-turn conversation quality -**See benchmark details below** for complete task descriptions and requirements. +See benchmark categories below and {ref}`benchmarks-full-list` for more details. ::: ## Benchmark Categories @@ -55,192 +47,332 @@ nemo-evaluator-launcher run \ ```{list-table} :header-rows: 1 -:widths: 20 30 30 20 +:widths: 20 30 30 50 * - Container - - Benchmarks - Description - NGC Catalog + - Benchmarks * - **simple-evals** - - MMLU Pro, GSM8K, ARC Challenge - - Core academic benchmarks + - Common evaluation tasks - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals) + - GPQA-D, MATH-500, AIME 24 & 25, HumanEval, MGSM, MMMLU, MMMLU-Pro, MMMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA * - **lm-evaluation-harness** - - MMLU, HellaSwag, TruthfulQA, PIQA - - Language model evaluation suite + - Language model benchmarks - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness) + - ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MINERVA MMMLU-Pro, RACE, TruthfulQA, AGIEval, BBH, BBQ, CSQA, Frames, Global MMMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMMLU, MMMLU-Pro, MMMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande * - **hle** - - Humanity's Last Exam - - Multi-modal benchmark at the frontier of human knowledge + - Academic knowledge and problem solving - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle) + - HLE * - **ifbench** - - Instruction Following Benchmark - - Precise instruction following evaluation + - Instruction following - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench) -* - **mmath** - - Multilingual Mathematical Reasoning - - Math reasoning across multiple languages - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath) + - IFBench * - **mtbench** - - MT-Bench - Multi-turn conversation evaluation - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) + - MT-Bench +* - **nemo-skills** + - Language model benchmarks (science, math, agentic) + - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo_skills) + - AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro +* - **profbench** + - Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA + - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) + - Report Gerenation, LLM Judge ``` +:::{note} +BFCL tasks from the nemo-skills container require function calling capabilities. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible. +::: + **Example Usage:** -```bash -# Run academic benchmark suite -nemo-evaluator-launcher run \ - --config-dir packages/nemo-evaluator-launcher/examples \ - --config-name local_llama_3_1_8b_instruct \ - -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]' + +Create `config.yml`: + +```yaml +defaults: + - execution: local + - deployment: none + - _self_ + +evaluation: + tasks: + - name: ifeval + - name: gsm8k_cot_instruct + - name: gpqa_diamond ``` -**Python API Example:** -```python -# Evaluate multiple academic benchmarks -academic_tasks = ["mmlu_pro", "gsm8k", "arc_challenge"] -for task in academic_tasks: - eval_config = EvaluationConfig( - type=task, - output_dir=f"./results/{task}/", - params=ConfigParams(temperature=0.01, parallelism=4) - ) - result = evaluate(eval_cfg=eval_config, target_cfg=target_config) +Run evaluation: + +```bash +export NGC_API_KEY=nvapi-... +export HF_TOKEN=hf_... + +nemo-evaluator-launcher run \ + --config-dir . \ + --config-name config.yml \ + -o execution.output_dir=results \ + -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \ + -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ + -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ### **Code Generation** ```{list-table} :header-rows: 1 -:widths: 25 30 30 15 +:widths: 20 30 30 50 * - Container - - Benchmarks - Description - NGC Catalog + - Benchmarks * - **bigcode-evaluation-harness** - - HumanEval, MBPP, APPS - - Code generation and completion + - Code generation evaluation - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness) + - MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts) * - **livecodebench** - - Live coding contests from LeetCode, AtCoder, CodeForces - - Contamination-free coding evaluation + - Coding - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench) + - LiveCodeBench (v1-v6, 0724_0125, 0824_0225) * - **scicode** - - Scientific research code generation - - Scientific computing and research + - Coding for scientific research - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode) + - SciCode ``` **Example Usage:** + +Create `config.yml`: + +```yaml +defaults: + - execution: local + - deployment: none + - _self_ + +evaluation: + tasks: + - name: humaneval_instruct + - name: mbbp +``` + +Run evaluation: + ```bash -# Run code generation evaluation +export NGC_API_KEY=nvapi-... + nemo-evaluator-launcher run \ - --config-dir packages/nemo-evaluator-launcher/examples \ - --config-name local_llama_3_1_8b_instruct \ - -o 'evaluation.tasks=["humaneval", "mbpp"]' + --config-dir . \ + --config-name config.yml \ + -o execution.output_dir=results \ + -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \ + -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ + -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ### **Safety and Security** ```{list-table} :header-rows: 1 -:widths: 25 35 25 15 +:widths: 20 30 30 50 * - Container - - Benchmarks - Description - NGC Catalog + - Benchmarks +* - **garak** + - Safety and vulnerability testing + - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak) + - Garak * - **safety-harness** - - Toxicity, bias, alignment tests - Safety and bias evaluation - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness) -* - **garak** - - Prompt injection, jailbreaking - - Security vulnerability scanning - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak) + - Aegis v2, BBQ, WildGuard ``` **Example Usage:** + +Create `config.yml`: + +```yaml +defaults: + - execution: local + - deployment: none + - _self_ + +evaluation: + tasks: + - name: aegis_v2 + - name: garak +``` + +Run evaluation: + ```bash -# Run comprehensive safety evaluation +export NGC_API_KEY=nvapi-... +export HF_TOKEN=hf_... + nemo-evaluator-launcher run \ - --config-dir packages/nemo-evaluator-launcher/examples \ - --config-name local_llama_3_1_8b_instruct \ - -o 'evaluation.tasks=["aegis_v2", "garak"]' + --config-dir . \ + --config-name config.yml \ + -o execution.output_dir=results \ + -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \ + -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ + -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` -### **Function Calling and Agentic AI** +### **Function Calling** ```{list-table} :header-rows: 1 -:widths: 25 30 30 15 +:widths: 20 30 30 50 * - Container - - Benchmarks - Description - NGC Catalog + - Benchmarks * - **bfcl** - - Berkeley Function Calling Leaderboard - - Function calling evaluation + - Function calling - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl) -* - **agentic_eval** - - Tool usage, planning tasks - - Agentic AI evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/agentic_eval) + - BFCL v2 and v3 * - **tooltalk** - - Tool interaction evaluation - - Tool usage assessment - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) + - Tool usage evaluation + - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) + - ToolTalk +``` + +:::{note} +Some of the tasks in this category require function calling capabilities. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible. +::: + +**Example Usage:** + +Create `config.yml`: + +```yaml +defaults: + - execution: local + - deployment: none + - _self_ + +evaluation: + tasks: + - name: bfclv2_ast_prompting + - name: tooltalk ``` +Run evaluation: + +```bash +export NGC_API_KEY=nvapi-... + +nemo-evaluator-launcher run \ + --config-dir . \ + --config-name config.yml \ + -o execution.output_dir=results \ + -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \ + -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ + -o +target.api_endpoint.api_key_name=NGC_API_KEY +``` + + ### **Vision-Language Models** ```{list-table} :header-rows: 1 -:widths: 25 35 25 15 +:widths: 20 30 30 50 * - Container - - Benchmarks - Description - NGC Catalog + - Benchmarks * - **vlmevalkit** - - VQA, image captioning, visual reasoning - Vision-language model evaluation - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit) + - AI2D, ChartQA, OCRBench, SlideVQA ``` -### **Retrieval and RAG** +:::{note} +The tasks in this category require a VLM chat endpoint. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible. +::: -```{list-table} -:header-rows: 1 -:widths: 25 35 25 15 +**Example Usage:** -* - Container - - Benchmarks - - Description - - NGC Catalog -* - **rag_retriever_eval** - - Document retrieval, context relevance - - RAG system evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval) +Create `config.yml`: + +```yaml +defaults: + - execution: local + - deployment: none + - _self_ + +evaluation: + tasks: + - name: ocrbench + - name: chartqa +``` + +Run evaluation: + +```bash +export NGC_API_KEY=nvapi-... + +nemo-evaluator-launcher run \ + --config-dir . \ + --config-name config.yml \ + -o execution.output_dir=results \ + -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \ + -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ + -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ### **Domain-Specific** ```{list-table} :header-rows: 1 -:widths: 25 35 25 15 +:widths: 20 30 30 50 * - Container - - Benchmarks - Description - NGC Catalog + - Benchmarks * - **helm** - - Medical AI evaluation (MedHELM) - - Healthcare-specific benchmarking + - Holistic evaluation framework - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm) + - MedHelm +``` + +**Example Usage:** + +Create `config.yml`: + +```yaml +defaults: + - execution: local + - deployment: none + - _self_ + +evaluation: + tasks: + - name: pubmed_qa + - name: medcalc_bench +``` + +Run evaluation: + +```bash +export NGC_API_KEY=nvapi-... + +nemo-evaluator-launcher run \ + --config-dir . \ + --config-name config.yml \ + -o execution.output_dir=results \ + -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \ + -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ + -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ## Container Details @@ -284,7 +416,7 @@ NeMo Evaluator provides multiple integration options to fit your workflow: ```bash # Launcher CLI (recommended for most users) nemo-evaluator-launcher ls tasks -nemo-evaluator-launcher run --config-dir packages/nemo-evaluator-launcher/examples --config-name local_mmlu_evaluation +nemo-evaluator-launcher run --config-dir . --config-name local_mmlu_evaluation.yaml # Container direct execution docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} nemo-evaluator ls @@ -295,17 +427,6 @@ docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_lates ## Benchmark Selection Best Practices -### For Academic Publications - -**Recommended Core Suite**: -1. **MMLU Pro** or **MMLU** - Broad knowledge assessment -2. **GSM8K** - Mathematical reasoning -3. **ARC Challenge** - Scientific reasoning -4. **HellaSwag** - Commonsense reasoning -5. **TruthfulQA** - Factual accuracy - -This suite provides comprehensive coverage across major evaluation dimensions. - ### For Model Development **Iterative Testing**: @@ -333,14 +454,17 @@ params = ConfigParams( ### For Specialized Domains - **Code Models**: Focus on `humaneval`, `mbpp`, `livecodebench` -- **Instruction Models**: Emphasize `ifeval`, `mtbench`, `gpqa_diamond` +- **Instruction Models**: Emphasize `ifbench`, `mtbench` - **Multilingual Models**: Include `arc_multilingual`, `hellaswag_multilingual`, `mgsm` - **Safety-Critical**: Prioritize `safety-harness` and `garak` evaluations +(benchmarks-full-list)= +## Full Benchmarks List + +```{include} ../_resources/tasks-table.md +``` + ## Next Steps -- **Quick Start**: See {ref}`evaluation-overview` for the fastest path to your first evaluation -- **Task-Specific Guides**: Explore {ref}`eval-run` for detailed evaluation workflows -- **Configuration**: Review {ref}`eval-parameters` for optimizing evaluation settings - **Container Details**: Browse {ref}`nemo-evaluator-containers` for complete specifications - **Custom Benchmarks**: Learn {ref}`framework-definition-file` for custom evaluations diff --git a/docs/evaluation/custom-tasks.md b/docs/evaluation/custom-tasks.md index 55604483..41eed787 100644 --- a/docs/evaluation/custom-tasks.md +++ b/docs/evaluation/custom-tasks.md @@ -1,3 +1,7 @@ +--- +orphan: true +--- + (eval-custom-tasks)= (custom-tasks)= diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md index ffefd940..82fab028 100644 --- a/docs/evaluation/index.md +++ b/docs/evaluation/index.md @@ -1,7 +1,3 @@ ---- -orphan: true ---- - (evaluation-overview)= # About Evaluation @@ -14,8 +10,8 @@ Before you run evaluations, ensure you have: 1. **Chosen your approach**: See {ref}`get-started-overview` for installation and setup guidance 2. **Deployed your model**: See {ref}`deployment-overview` for deployment options -3. **OpenAI-compatible endpoint**: Your model must expose a compatible API -4. **API credentials**: Access tokens for your model endpoint +3. **OpenAI-compatible endpoint**: Your model must expose a compatible API (see {ref}`deployment-testing-compatibility`). +4. **API credentials**: Access tokens for your model endpoint and Hugging Face Hub. --- @@ -33,9 +29,11 @@ Before you run evaluations, ensure you have: **Step 2: Select Benchmarks** Common academic suites: -- **Language Understanding**: `mmlu_pro`, `arc_challenge`, `hellaswag`, `truthfulqa` -- **Mathematical Reasoning**: `gsm8k`, `math` -- **Instruction Following**: `ifeval`, `gpqa_diamond` +- **General Knowledge**: `mmlu_pro`, `gpqa_diamond` +- **Mathematical Reasoning**: `AIME_2025`, `mgsm` +- **Instruction Following**: `ifbench`, `mtbench` + + Discover all available tasks: ```bash @@ -44,51 +42,38 @@ nemo-evaluator-launcher ls tasks **Step 3: Run Evaluation** -Using Launcher CLI: -```bash -nemo-evaluator-launcher run \ - --config-dir packages/nemo-evaluator-launcher/examples \ - --config-name local_llama_3_1_8b_instruct \ - -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]' \ - -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ - -o target.api_endpoint.api_key=${YOUR_API_KEY} +Create `config.yml`: + +```yaml +defaults: + - execution: local + - deployment: none + - _self_ + +evaluation: + tasks: + - name: mmlu_pro + - name: ifbench ``` -Using Python API: -```python -from nemo_evaluator.core.evaluate import evaluate -from nemo_evaluator.api.api_dataclasses import ( - EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType -) - -# Configure and run -eval_config = EvaluationConfig( - type="mmlu_pro", - output_dir="./results", - params=ConfigParams( - limit_samples=100, # Start with subset - temperature=0.01, # Near-deterministic - max_new_tokens=512, - parallelism=4 - ) -) - -target_config = EvaluationTarget( - api_endpoint=ApiEndpoint( - url="https://integrate.api.nvidia.com/v1/chat/completions", - model_id="meta/llama-3.1-8b-instruct", - type=EndpointType.CHAT, - api_key="YOUR_API_KEY" - ) -) - -result = evaluate(eval_cfg=eval_config, target_cfg=target_config) +Launch the job: + +```bash +export NGC_API_KEY=nvapi-... + +nemo-evaluator-launcher run \ + --config-dir . \ + --config-name config.yml \ + -o execution.output_dir=results \ + -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \ + -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ + -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` -**Next Steps**: + ::: --- @@ -100,12 +85,6 @@ Select a workflow based on your environment and desired level of control. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 -:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Run Evaluations -:link: run-evals/index -:link-type: doc -Step-by-step guides for different evaluation scenarios using launcher, core API, and container workflows. -::: - :::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` Launcher Workflows :link: ../get-started/quickstart/launcher :link-type: doc @@ -133,17 +112,17 @@ Configure your evaluations, create custom tasks, explore benchmarks, and extend ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 -:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration Parameters + -:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Custom Task Configuration + :::{grid-item-card} {octicon}`list-unordered;1.5em;sd-mr-1` Benchmark Catalog :link: eval-benchmarks @@ -184,12 +163,6 @@ Export evaluation results to MLflow, Weights & Biases, Google Sheets, and other Configure request/response processing, logging, caching, and custom interceptors. ::: -:::{grid-item-card} {octicon}`alert;1.5em;sd-mr-1` Troubleshooting -:link: ../troubleshooting/index -:link-type: doc -Resolve common evaluation issues, debug configuration problems, and optimize evaluation performance. -::: - :::: ## Core Evaluation Concepts diff --git a/docs/evaluation/parameters.md b/docs/evaluation/parameters.md index 7248e1a9..84c68950 100644 --- a/docs/evaluation/parameters.md +++ b/docs/evaluation/parameters.md @@ -1,3 +1,6 @@ +--- +orphan: true +--- (eval-parameters)= # Evaluation Configuration Parameters diff --git a/docs/evaluation/run-evals/index.md b/docs/evaluation/run-evals/index.md index 668aa4e6..7459cb97 100644 --- a/docs/evaluation/run-evals/index.md +++ b/docs/evaluation/run-evals/index.md @@ -1,3 +1,6 @@ +--- +orphan: true +--- (eval-run)= # Run Evaluations diff --git a/docs/get-started/quickstart/core.md b/docs/get-started/quickstart/core.md index 2474c5f2..c75f6e1b 100644 --- a/docs/get-started/quickstart/core.md +++ b/docs/get-started/quickstart/core.md @@ -9,7 +9,7 @@ The NeMo Evaluator Core provides direct Python API access for custom configurati - Python environment - OpenAI-compatible endpoint (hosted or self-deployed) and an API key (if the endpoint is gated) -- Verify endpoint compatibility using our [Testing Endpoint Compatibility](../../deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) guide +- Verify endpoint compatibility using our {ref}`deployment-testing-compatibility` guide ## Quick Start diff --git a/docs/get-started/quickstart/index.md b/docs/get-started/quickstart/index.md index 067dd393..a0ade84e 100644 --- a/docs/get-started/quickstart/index.md +++ b/docs/get-started/quickstart/index.md @@ -68,7 +68,7 @@ NeMo Evaluator works with any OpenAI-compatible endpoint. You have several optio ### **Self-Hosted Options** -If you prefer to host your own models, verify OpenAI compatibility using our [Testing Endpoint Compatibility](../../deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) guide. +If you prefer to host your own models, verify OpenAI compatibility using our {ref}`deployment-testing-compatibility` guide. If you are deploying the model locally with Docker, you can use a dedicated docker network. This will provide a secure connetion between deployment and evaluation docker containers. diff --git a/docs/index.md b/docs/index.md index 142a70b9..1afba1bd 100644 --- a/docs/index.md +++ b/docs/index.md @@ -354,15 +354,19 @@ Quickstart About Tutorials ::: --> - +:::{toctree} :caption: Evaluation :hidden: About Model Evaluation -Run Evals -Custom Task Configuration Benchmark Catalog -::: --> +:::