diff --git a/README.md b/README.md
index 0b146156..2c673903 100644
--- a/README.md
+++ b/README.md
@@ -51,7 +51,6 @@ NeMo Evaluator Launcher provides pre-built evaluation containers for different e
 | **mmath** | Multilingual math reasoning | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath) | `25.09.1` | EN, ZH, AR, ES, FR, JA, KO, PT, TH, VI |
 | **mtbench** | Multi-turn conversation evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) | `25.09.1` | MT-Bench |
 | **nemo-skills** | Language model benchmarks (science, math, agentic)  | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo_skills) | `25.09.1` | AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro |
-| **mtbench** | Multi-turn conversation evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) | `25.09.1` | MT-Bench |
 | **profbench** | Professional domains in Business and Scientific Research | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/profbench) | `25.09.1` | ProfBench |
 | **rag_retriever_eval** | RAG system evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval) | `25.09.1` | RAG, Retriever |
 | **safety-harness** | Safety and bias evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness) | `25.09.1` | Aegis v2, BBQ, WildGuard |
diff --git a/docs/_resources/tasks-table.md b/docs/_resources/tasks-table.md
new file mode 100644
index 00000000..73fc2254
--- /dev/null
+++ b/docs/_resources/tasks-table.md
@@ -0,0 +1,116 @@
+
+```{list-table}
+:header-rows: 1
+:widths: 20 25 15 15 25
+
+* - Container
+  - Description
+  - NGC Catalog
+  - Latest Tag
+  - Key Benchmarks
+* - **agentic_eval**
+  - Agentic AI evaluation framework
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/agentic_eval)
+  - {{ docker_compose_latest }}
+  - agentic_eval_answer_accuracy, agentic_eval_goal_accuracy_with_reference, agentic_eval_goal_accuracy_without_reference, agentic_eval_topic_adherence, agentic_eval_tool_call_accuracy
+* - **bfcl**
+  - Function calling evaluation
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl)
+  - {{ docker_compose_latest }}
+  - bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting
+* - **bigcode-evaluation-harness**
+  - Code generation evaluation
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness)
+  - {{ docker_compose_latest }}
+  - humaneval, humaneval_instruct, humanevalplus, mbpp, mbppplus, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts
+* -  **compute-eval**
+  - CUDA code evaluation
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/compute-eval)
+  - {{ docker_compose_latest }}
+  - cccl_problems, combined_problems, cuda_problems
+* - **garak**
+  - Security and robustness testing
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak)
+  - {{ docker_compose_latest }}
+  - garak
+* - **genai-perf**
+  - GenAI performance benchmarking
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/genai-perf)
+  - {{ docker_compose_latest }}
+  - genai_perf_generation, genai_perf_summarization
+* - **helm**
+  - Holistic evaluation framework
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm)
+  - {{ docker_compose_latest }}
+  - ci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med
+* - **hle**
+  - Academic knowledge and problem solving
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle)
+  - {{ docker_compose_latest }}
+  - hle
+* - **ifbench**
+  - Instruction following evaluation
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench)
+  - {{ docker_compose_latest }}
+  - ifbench
+* - **livecodebench**
+  - Live coding evaluation
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench)
+  - {{ docker_compose_latest }}
+  - AA_codegeneration, codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225
+* - **lm-evaluation-harness**
+  - Language model benchmarks
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness)
+  - {{ docker_compose_latest }}
+  - adlr_arc_challenge_llama, adlr_gsm8k_fewshot_cot, adlr_humaneval_greedy, adlr_humanevalplus_greedy, adlr_mbpp_sanitized_3shot_greedy, adlr_mbppplus_greedy_sanitized, adlr_minerva_math_nemo, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq, commonsense_qa, frames_naive, frames_naive_with_links, frames_oracle, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gpqa_diamond_cot_5_shot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str, mbpp_plus, mgsm, mgsm_cot, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox, mmlu_prox_de, mmlu_prox_es, mmlu_prox_fr, mmlu_prox_it, mmlu_prox_ja, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, winogrande
+* - **mmath**
+  - Multilingual math reasoning
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath)
+  - {{ docker_compose_latest }}
+  - mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh
+* - **mtbench**
+  - Multi-turn conversation evaluation
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench)
+  - {{ docker_compose_latest }}
+  - mtbench, mtbench-cor1
+* - **nemo-skills**
+  - Language model benchmarks (science, math, agentic)
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo_skills)
+  - {{ docker_compose_latest }}
+  - ns_aime2024, ns_aime2025, ns_aime2025_ef, ns_bfcl_v3, ns_gpqa, ns_gpqa_ef, ns_hle, ns_livecodebench, ns_mmlu, ns_mmlu_pro
+* - **profbench**
+  - Professional domains in Business and Scientific Research
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/profbench)
+  - {{ docker_compose_latest }}
+  - report_generation, llm_judge
+* - **rag_retriever_eval**
+  - RAG system evaluation
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval)
+  - {{ docker_compose_latest }}
+  - RAG, Retriever
+* - **safety-harness**
+  - Safety and bias evaluation
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness)
+  - {{ docker_compose_latest }}
+  - aegis_v2, aegis_v2_ar, aegis_v2_de, aegis_v2_es, aegis_v2_fr, aegis_v2_hi, aegis_v2_ja, aegis_v2_reasoning, aegis_v2_th, aegis_v2_zh-CN, bbq_full, bbq_small, wildguard
+* - **scicode**
+  - Coding for scientific research
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode)
+  - {{ docker_compose_latest }}
+  - aa_scicode, scicode, scicode_background
+* - **simple-evals**
+  - Basic evaluation tasks
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals)
+  - {{ docker_compose_latest }}
+  - AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, aime_2024_nemo, aime_2025_nemo, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_nemo, gpqa_extended, gpqa_main, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa
+* - **tooltalk**
+  - Tool usage evaluation
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk)
+  - {{ docker_compose_latest }}
+  - tooltalk
+* - **vlmevalkit**
+  - Vision-language model evaluation
+  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit)
+  - {{ docker_compose_latest }}
+  - ai2d_judge, chartqa, ocrbench, slidevqa
+```
diff --git a/docs/about/key-features.md b/docs/about/key-features.md
index 519f9713..5079bf22 100644
--- a/docs/about/key-features.md
+++ b/docs/about/key-features.md
@@ -57,115 +57,7 @@ nemo-evaluator-launcher export <invocation_id> --dest gsheets
 ### Container-First Architecture
 Pre-built NGC containers guarantee reproducible results across environments:
 
-```{list-table}
-:header-rows: 1
-:widths: 20 25 15 15 25
-
-* - Container
-  - Description
-  - NGC Catalog
-  - Latest Tag
-  - Key Benchmarks
-* - **agentic_eval**
-  - Agentic AI evaluation framework
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/agentic_eval)
-  - {{ docker_compose_latest }}
-  - agentic_eval_answer_accuracy, agentic_eval_goal_accuracy_with_reference, agentic_eval_goal_accuracy_without_reference, agentic_eval_topic_adherence, agentic_eval_tool_call_accuracy
-* - **bfcl**
-  - Function calling evaluation
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl)
-  - {{ docker_compose_latest }}
-  - bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting
-* - **bigcode-evaluation-harness**
-  - Code generation evaluation
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness)
-  - {{ docker_compose_latest }}
-  - humaneval, humaneval_instruct, humanevalplus, mbpp, mbppplus, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts
-* -  **compute-eval**
-  - CUDA code evaluation
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/compute-eval)
-  - {{ docker_compose_latest }}
-  - cccl_problems, combined_problems, cuda_problems
-* - **garak**
-  - Security and robustness testing
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak)
-  - {{ docker_compose_latest }}
-  - garak
-* - **genai-perf**
-  - GenAI performance benchmarking
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/genai-perf)
-  - {{ docker_compose_latest }}
-  - genai_perf_generation, genai_perf_summarization
-* - **helm**
-  - Holistic evaluation framework
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm)
-  - {{ docker_compose_latest }}
-  - ci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med
-* - **hle**
-  - Academic knowledge and problem solving
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle)
-  - {{ docker_compose_latest }}
-  - hle
-* - **ifbench**
-  - Instruction following evaluation
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench)
-  - {{ docker_compose_latest }}
-  - ifbench
-* - **livecodebench**
-  - Live coding evaluation
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench)
-  - {{ docker_compose_latest }}
-  - AA_codegeneration, codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225
-* - **lm-evaluation-harness**
-  - Language model benchmarks
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness)
-  - {{ docker_compose_latest }}
-  - adlr_arc_challenge_llama, adlr_gsm8k_fewshot_cot, adlr_humaneval_greedy, adlr_humanevalplus_greedy, adlr_mbpp_sanitized_3shot_greedy, adlr_mbppplus_greedy_sanitized, adlr_minerva_math_nemo, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq, commonsense_qa, frames_naive, frames_naive_with_links, frames_oracle, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gpqa_diamond_cot_5_shot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str, mbpp_plus, mgsm, mgsm_cot, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox, mmlu_prox_de, mmlu_prox_es, mmlu_prox_fr, mmlu_prox_it, mmlu_prox_ja, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, winogrande
-* - **mmath**
-  - Multilingual math reasoning
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath)
-  - {{ docker_compose_latest }}
-  - mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh
-* - **mtbench**
-  - Multi-turn conversation evaluation
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench)
-  - {{ docker_compose_latest }}
-  - mtbench, mtbench-cor1
-* - **nemo-skills**
-  - Language model benchmarks (science, math, agentic)
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo_skills)
-  - {{ docker_compose_latest }}
-  - ns_aime2024, ns_aime2025, ns_aime2025_ef, ns_bfcl_v3, ns_gpqa, ns_gpqa_ef, ns_hle, ns_livecodebench, ns_mmlu, ns_mmlu_pro
-* - **rag_retriever_eval**
-  - RAG system evaluation
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval)
-  - {{ docker_compose_latest }}
-  - RAG, Retriever
-* - **safety-harness**
-  - Safety and bias evaluation
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness)
-  - {{ docker_compose_latest }}
-  - aegis_v2, aegis_v2_ar, aegis_v2_de, aegis_v2_es, aegis_v2_fr, aegis_v2_hi, aegis_v2_ja, aegis_v2_reasoning, aegis_v2_th, aegis_v2_zh-CN, bbq_full, bbq_small, wildguard
-* - **scicode**
-  - Coding for scientific research
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode)
-  - {{ docker_compose_latest }}
-  - aa_scicode, scicode, scicode_background
-* - **simple-evals**
-  - Basic evaluation tasks
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals)
-  - {{ docker_compose_latest }}
-  - AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, aime_2024_nemo, aime_2025_nemo, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_nemo, gpqa_extended, gpqa_main, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa
-* - **tooltalk**
-  - Tool usage evaluation
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk)
-  - {{ docker_compose_latest }}
-  - tooltalk
-* - **vlmevalkit**
-  - Vision-language model evaluation
-  - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit)
-  - {{ docker_compose_latest }}
-  - ai2d_judge, chartqa, ocrbench, slidevqa
+```{include} ../_resources/tasks-table.md
 ```
 
 ```bash
@@ -302,7 +194,7 @@ NeMo Evaluator supports OpenAI-compatible API endpoints:
 
 - **Hosted Models**: NVIDIA Build, OpenAI, Anthropic, Cohere
 - **Self-Hosted**: vLLM, TRT-LLM, NeMo Framework
-- **Custom Endpoints**: Any service implementing OpenAI API spec (test compatibility with our [Testing Endpoint Compatibility](../deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) guide)
+- **Custom Endpoints**: Any service implementing OpenAI API spec (test compatibility with our {ref}`deployment-testing-compatibility` guide)
 
 The platform supports the following endpoint types:
 
diff --git a/docs/conf.py b/docs/conf.py
index fe8ffaf4..2b9a842b 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -136,7 +136,7 @@
     "support_email": "update-me",
     "min_python_version": "3.8",
     "recommended_cuda": "12.0+",
-    "docker_compose_latest": "25.09",
+    "docker_compose_latest": "25.09.1",
 }
 
 # Enable figure numbering
diff --git a/docs/deployment/bring-your-own-endpoint/index.md b/docs/deployment/bring-your-own-endpoint/index.md
index dc7fb7cf..055340e8 100644
--- a/docs/deployment/bring-your-own-endpoint/index.md
+++ b/docs/deployment/bring-your-own-endpoint/index.md
@@ -86,7 +86,7 @@ Your endpoint must provide OpenAI-compatible APIs:
 - **Health Check**: `/v1/health` (GET) - For monitoring (recommended)
 
 ### Request/Response Format
-Must follow OpenAI API specifications for compatibility with evaluation frameworks. See the [Testing Endpoint Compatibility](testing-endpoint-oai-compatibility.md) guide to verify your endpoint's OpenAI compatibility.
+Must follow OpenAI API specifications for compatibility with evaluation frameworks. See the {ref}`deployment-testing-compatibility` guide to verify your endpoint's OpenAI compatibility.
 
 
 ## Configuration Management
diff --git a/docs/deployment/bring-your-own-endpoint/manual-deployment.md b/docs/deployment/bring-your-own-endpoint/manual-deployment.md
index 5c1b1c57..111ef435 100644
--- a/docs/deployment/bring-your-own-endpoint/manual-deployment.md
+++ b/docs/deployment/bring-your-own-endpoint/manual-deployment.md
@@ -27,7 +27,7 @@ This guide focuses on NeMo Evaluator configuration. For specific serving framewo
 
 ## Using Manual Deployments with NeMo Evaluator
 
-Before connecting to your manual deployment, verify it's properly configured using our [Testing Endpoint Compatibility](testing-endpoint-oai-compatibility.md) guide.
+Before connecting to your manual deployment, verify it's properly configured using our {ref}`deployment-testing-compatibility` guide.
 
 ### With Launcher
 
diff --git a/docs/deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md b/docs/deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md
index 799cbec9..f4cd305f 100644
--- a/docs/deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md
+++ b/docs/deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md
@@ -1,3 +1,4 @@
+(deployment-testing-compatibility)=
 # Testing Endpoint Compatibility
 
 This guide helps you test your hosted endpoint to verify OpenAI-compatible API compatibility using `curl` requests for different task types. Models deployed using `nemo-evaluator-launcher` should be compatible with these tests.
diff --git a/docs/deployment/index.md b/docs/deployment/index.md
index 3a95cdc5..5a30ea45 100644
--- a/docs/deployment/index.md
+++ b/docs/deployment/index.md
@@ -133,7 +133,7 @@ Choose from these approaches when managing your own deployment:
 <!-- 
 ### Manual Deployment
 - **vLLM**: High-performance serving with PagedAttention optimization
-- **Custom serving**: Any OpenAI-compatible endpoint (verify compatibility with our [Testing Endpoint Compatibility](bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) guide) -->
+- **Custom serving**: Any OpenAI-compatible endpoint (verify compatibility with our {ref}`deployment-testing-compatibility` guide) -->
 
 ### Hosted Services  
 - **NVIDIA Build**: Ready-to-use hosted models with OpenAI-compatible APIs
diff --git a/docs/evaluation/_snippets/commands/list_tasks.sh b/docs/evaluation/_snippets/commands/list_tasks.sh
index 31b20964..1930dd7c 100755
--- a/docs/evaluation/_snippets/commands/list_tasks.sh
+++ b/docs/evaluation/_snippets/commands/list_tasks.sh
@@ -1,14 +1,14 @@
 #!/bin/bash
-# Task discovery commands for NeMo Evaluator
+# Task discovery commands for NeMo Evaluator Launcher
 
 # [snippet-start]
 # List all available benchmarks
-nemo-evaluator-launcher ls tasks
+nemo-evaluator-launcher ls
 
 # Output as JSON for programmatic filtering
-nemo-evaluator-launcher ls tasks --json
+nemo-evaluator-launcher ls --json
 
-# Filter for specific task types (example: academic benchmarks)
-nemo-evaluator-launcher ls tasks | grep -E "(mmlu|gsm8k|arc)"
+# Filter for specific task types (example: mmlu, gsm8k, and arc_challenge)
+nemo-evaluator-launcher ls | grep -E "(mmlu|gsm8k|arc)"
 # [snippet-end]
 
diff --git a/docs/evaluation/_snippets/commands/list_tasks_core.sh b/docs/evaluation/_snippets/commands/list_tasks_core.sh
new file mode 100644
index 00000000..902bb2b6
--- /dev/null
+++ b/docs/evaluation/_snippets/commands/list_tasks_core.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+# Task discovery commands for NeMo Evaluator
+# FIXME(martas): Hard-code the container version
+
+# [snippet-start]
+# List benchmarks available in the container
+docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.09.1 nemo-evaluator ls
+# [snippet-end]
diff --git a/docs/evaluation/benchmarks.md b/docs/evaluation/benchmarks.md
index c37ac8e7..489a94da 100644
--- a/docs/evaluation/benchmarks.md
+++ b/docs/evaluation/benchmarks.md
@@ -2,13 +2,9 @@
 
 # Benchmark Catalog
 
-Comprehensive catalog of 100+ benchmarks across 18 evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform.
+Comprehensive catalog of hundreds of benchmarks across popular evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform.
 
 
-## Overview
-
-NeMo Evaluator provides access to benchmarks across multiple domains through pre-built NGC containers and the unified launcher CLI. Each container specializes in different evaluation domains while maintaining consistent interfaces and reproducible results.
-
 ## Available via Launcher
 
 ```{literalinclude} _snippets/commands/list_tasks.sh
@@ -17,36 +13,32 @@ NeMo Evaluator provides access to benchmarks across multiple domains through pre
 :end-before: "# [snippet-end]"
 ```
 
+## Available via Direct Container Access
+
+```{literalinclude} _snippets/commands/list_tasks_core.sh
+:language: bash
+:start-after: "# [snippet-start]"
+:end-before: "# [snippet-end]"
+```
+
 ## Choosing Benchmarks for Academic Research
 
 :::{admonition} Benchmark Selection Guide
 :class: tip
 
-**For Language Understanding & General Knowledge**:
-Recommended suite for comprehensive model evaluation:
+**For General Knowledge**:
 - `mmlu_pro` - Expert-level knowledge across 14 domains
-- `arc_challenge` - Complex reasoning and science questions
-- `hellaswag` - Commonsense reasoning about situations
-- `truthfulqa` - Factual accuracy vs. plausibility
-
-```bash
-nemo-evaluator-launcher run \
-    --config-dir packages/nemo-evaluator-launcher/examples \
-    --config-name local_academic_suite \
-    -o 'evaluation.tasks=["mmlu_pro", "arc_challenge", "hellaswag", "truthfulqa"]'
-```
+- `gpqa_diamond` - Graduate-level science questions
 
 **For Mathematical & Quantitative Reasoning**:
-- `gsm8k` - Grade school math word problems
-- `math` - Competition-level mathematics
+- `AIME_2025` - American Invitational Mathematics Examination (AIME) 2025 questions
 - `mgsm` - Multilingual math reasoning
 
 **For Instruction Following & Alignment**:
-- `ifeval` - Precise instruction following
-- `gpqa_diamond` - Graduate-level science questions
+- `ifbench` - Precise instruction following
 - `mtbench` - Multi-turn conversation quality
 
-**See benchmark details below** for complete task descriptions and requirements.
+See benchmark categories below and {ref}`benchmarks-full-list` for more details.
 :::
 
 ## Benchmark Categories
@@ -55,192 +47,332 @@ nemo-evaluator-launcher run \
 
 ```{list-table}
 :header-rows: 1
-:widths: 20 30 30 20
+:widths: 20 30 30 50
 
 * - Container
-  - Benchmarks
   - Description
   - NGC Catalog
+  - Benchmarks
 * - **simple-evals**
-  - MMLU Pro, GSM8K, ARC Challenge
-  - Core academic benchmarks
+  - Common evaluation tasks
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals)
+  - GPQA-D, MATH-500, AIME 24 & 25, HumanEval, MGSM, MMMLU, MMMLU-Pro, MMMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA 
 * - **lm-evaluation-harness**
-  - MMLU, HellaSwag, TruthfulQA, PIQA
-  - Language model evaluation suite
+  - Language model benchmarks
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness)
+  - ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MINERVA MMMLU-Pro, RACE, TruthfulQA, AGIEval, BBH, BBQ, CSQA, Frames, Global MMMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMMLU, MMMLU-Pro, MMMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande
 * - **hle**
-  - Humanity's Last Exam
-  - Multi-modal benchmark at the frontier of human knowledge
+  - Academic knowledge and problem solving
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle)
+  - HLE 
 * - **ifbench**
-  - Instruction Following Benchmark
-  - Precise instruction following evaluation
+  - Instruction following
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench)
-* - **mmath**
-  - Multilingual Mathematical Reasoning
-  - Math reasoning across multiple languages
-  - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath)
+  - IFBench 
 * - **mtbench**
-  - MT-Bench
   - Multi-turn conversation evaluation
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench)
+  - MT-Bench
+* - **nemo-skills**
+  - Language model benchmarks (science, math, agentic) 
+  - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo_skills)
+  - AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro 
+* - **profbench**
+  - Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA
+  - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench)
+  - Report Gerenation, LLM Judge
 ```
 
+:::{note}
+BFCL tasks from the nemo-skills container require function calling capabilities. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible.
+:::
+
 **Example Usage:**
-```bash
-# Run academic benchmark suite
-nemo-evaluator-launcher run \
-    --config-dir packages/nemo-evaluator-launcher/examples \
-    --config-name local_llama_3_1_8b_instruct \
-    -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]'
+
+Create `config.yml`:
+
+```yaml
+defaults:
+  - execution: local
+  - deployment: none
+  - _self_
+
+evaluation:
+  tasks:
+    - name: ifeval
+    - name: gsm8k_cot_instruct
+    - name: gpqa_diamond
 ```
 
-**Python API Example:**
-```python
-# Evaluate multiple academic benchmarks
-academic_tasks = ["mmlu_pro", "gsm8k", "arc_challenge"]
-for task in academic_tasks:
-    eval_config = EvaluationConfig(
-        type=task,
-        output_dir=f"./results/{task}/",
-        params=ConfigParams(temperature=0.01, parallelism=4)
-    )
-    result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
+Run evaluation:
+
+```bash
+export NGC_API_KEY=nvapi-...
+export HF_TOKEN=hf_...
+
+nemo-evaluator-launcher run \
+    --config-dir . \
+    --config-name config.yml \
+    -o execution.output_dir=results \
+    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
+    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
+    -o +target.api_endpoint.api_key_name=NGC_API_KEY
 ```
 
 ###  **Code Generation**
 
 ```{list-table}
 :header-rows: 1
-:widths: 25 30 30 15
+:widths: 20 30 30 50
 
 * - Container
-  - Benchmarks
   - Description
   - NGC Catalog
+  - Benchmarks
 * - **bigcode-evaluation-harness**
-  - HumanEval, MBPP, APPS
-  - Code generation and completion
+  - Code generation evaluation
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness)
+  - MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts) 
 * - **livecodebench**
-  - Live coding contests from LeetCode, AtCoder, CodeForces
-  - Contamination-free coding evaluation
+  - Coding
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench)
+  - LiveCodeBench (v1-v6, 0724_0125, 0824_0225) 
 * - **scicode**
-  - Scientific research code generation
-  - Scientific computing and research
+  - Coding for scientific research
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode)
+  - SciCode 
 ```
 
 **Example Usage:**
+
+Create `config.yml`:
+
+```yaml
+defaults:
+  - execution: local
+  - deployment: none
+  - _self_
+
+evaluation:
+  tasks:
+    - name: humaneval_instruct
+    - name: mbbp
+```
+
+Run evaluation:
+
 ```bash
-# Run code generation evaluation
+export NGC_API_KEY=nvapi-...
+
 nemo-evaluator-launcher run \
-    --config-dir packages/nemo-evaluator-launcher/examples \
-    --config-name local_llama_3_1_8b_instruct \
-    -o 'evaluation.tasks=["humaneval", "mbpp"]'
+    --config-dir . \
+    --config-name config.yml \
+    -o execution.output_dir=results \
+    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
+    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
+    -o +target.api_endpoint.api_key_name=NGC_API_KEY
 ```
 
 ###  **Safety and Security**
 
 ```{list-table}
 :header-rows: 1
-:widths: 25 35 25 15
+:widths: 20 30 30 50
 
 * - Container
-  - Benchmarks
   - Description
   - NGC Catalog
+  - Benchmarks
+* - **garak**
+  - Safety and vulnerability testing
+  - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak)
+  - Garak
 * - **safety-harness**
-  - Toxicity, bias, alignment tests
   - Safety and bias evaluation
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness)
-* - **garak**
-  - Prompt injection, jailbreaking
-  - Security vulnerability scanning
-  - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak)
+  - Aegis v2, BBQ, WildGuard
 ```
 
 **Example Usage:**
+
+Create `config.yml`:
+
+```yaml
+defaults:
+  - execution: local
+  - deployment: none
+  - _self_
+
+evaluation:
+  tasks:
+    - name: aegis_v2
+    - name: garak
+```
+
+Run evaluation:
+
 ```bash
-# Run comprehensive safety evaluation
+export NGC_API_KEY=nvapi-...
+export HF_TOKEN=hf_...
+
 nemo-evaluator-launcher run \
-    --config-dir packages/nemo-evaluator-launcher/examples \
-    --config-name local_llama_3_1_8b_instruct \
-    -o 'evaluation.tasks=["aegis_v2", "garak"]'
+    --config-dir . \
+    --config-name config.yml \
+    -o execution.output_dir=results \
+    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
+    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
+    -o +target.api_endpoint.api_key_name=NGC_API_KEY
 ```
 
-###  **Function Calling and Agentic AI**
+###  **Function Calling**
 
 ```{list-table}
 :header-rows: 1
-:widths: 25 30 30 15
+:widths: 20 30 30 50
 
 * - Container
-  - Benchmarks
   - Description
   - NGC Catalog
+  - Benchmarks
 * - **bfcl**
-  - Berkeley Function Calling Leaderboard
-  - Function calling evaluation
+  - Function calling
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl)
-* - **agentic_eval**
-  - Tool usage, planning tasks
-  - Agentic AI evaluation
-  - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/agentic_eval)
+  - BFCL v2 and v3 
 * - **tooltalk**
-  - Tool interaction evaluation
-  - Tool usage assessment
-  - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk)
+ - Tool usage evaluation
+ - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk)
+ - ToolTalk 
+```
+
+:::{note}
+Some of the tasks in this category require function calling capabilities. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible.
+:::
+
+**Example Usage:**
+
+Create `config.yml`:
+
+```yaml
+defaults:
+  - execution: local
+  - deployment: none
+  - _self_
+
+evaluation:
+  tasks:
+    - name: bfclv2_ast_prompting
+    - name: tooltalk
 ```
 
+Run evaluation:
+
+```bash
+export NGC_API_KEY=nvapi-...
+
+nemo-evaluator-launcher run \
+    --config-dir . \
+    --config-name config.yml \
+    -o execution.output_dir=results \
+    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
+    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
+    -o +target.api_endpoint.api_key_name=NGC_API_KEY
+```
+
+
 ###  **Vision-Language Models**
 
 ```{list-table}
 :header-rows: 1
-:widths: 25 35 25 15
+:widths: 20 30 30 50
 
 * - Container
-  - Benchmarks
   - Description
   - NGC Catalog
+  - Benchmarks
 * - **vlmevalkit**
-  - VQA, image captioning, visual reasoning
   - Vision-language model evaluation
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit)
+  - AI2D, ChartQA, OCRBench, SlideVQA
 ```
 
-###  **Retrieval and RAG**
+:::{note}
+The tasks in this category require a VLM chat endpoint. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible.
+:::
 
-```{list-table}
-:header-rows: 1
-:widths: 25 35 25 15
+**Example Usage:**
 
-* - Container
-  - Benchmarks
-  - Description
-  - NGC Catalog
-* - **rag_retriever_eval**
-  - Document retrieval, context relevance
-  - RAG system evaluation
-  - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval)
+Create `config.yml`:
+
+```yaml
+defaults:
+  - execution: local
+  - deployment: none
+  - _self_
+
+evaluation:
+  tasks:
+    - name: ocrbench
+    - name: chartqa
+```
+
+Run evaluation:
+
+```bash
+export NGC_API_KEY=nvapi-...
+
+nemo-evaluator-launcher run \
+    --config-dir . \
+    --config-name config.yml \
+    -o execution.output_dir=results \
+    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
+    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
+    -o +target.api_endpoint.api_key_name=NGC_API_KEY
 ```
 
 ###  **Domain-Specific**
 
 ```{list-table}
 :header-rows: 1
-:widths: 25 35 25 15
+:widths: 20 30 30 50
 
 * - Container
-  - Benchmarks
   - Description
   - NGC Catalog
+  - Benchmarks
 * - **helm**
-  - Medical AI evaluation (MedHELM)
-  - Healthcare-specific benchmarking
+  - Holistic evaluation framework
   - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm)
+  - MedHelm 
+```
+
+**Example Usage:**
+
+Create `config.yml`:
+
+```yaml
+defaults:
+  - execution: local
+  - deployment: none
+  - _self_
+
+evaluation:
+  tasks:
+    - name: pubmed_qa
+    - name: medcalc_bench
+```
+
+Run evaluation:
+
+```bash
+export NGC_API_KEY=nvapi-...
+
+nemo-evaluator-launcher run \
+    --config-dir . \
+    --config-name config.yml \
+    -o execution.output_dir=results \
+    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
+    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
+    -o +target.api_endpoint.api_key_name=NGC_API_KEY
 ```
 
 ## Container Details
@@ -284,7 +416,7 @@ NeMo Evaluator provides multiple integration options to fit your workflow:
 ```bash
 # Launcher CLI (recommended for most users)
 nemo-evaluator-launcher ls tasks
-nemo-evaluator-launcher run --config-dir packages/nemo-evaluator-launcher/examples --config-name local_mmlu_evaluation
+nemo-evaluator-launcher run --config-dir . --config-name local_mmlu_evaluation.yaml
 
 # Container direct execution
 docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} nemo-evaluator ls
@@ -295,17 +427,6 @@ docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_lates
 
 ## Benchmark Selection Best Practices
 
-### For Academic Publications
-
-**Recommended Core Suite**:
-1. **MMLU Pro** or **MMLU** - Broad knowledge assessment
-2. **GSM8K** - Mathematical reasoning
-3. **ARC Challenge** - Scientific reasoning
-4. **HellaSwag** - Commonsense reasoning
-5. **TruthfulQA** - Factual accuracy
-
-This suite provides comprehensive coverage across major evaluation dimensions.
-
 ### For Model Development
 
 **Iterative Testing**:
@@ -333,14 +454,17 @@ params = ConfigParams(
 ### For Specialized Domains
 
 - **Code Models**: Focus on `humaneval`, `mbpp`, `livecodebench`
-- **Instruction Models**: Emphasize `ifeval`, `mtbench`, `gpqa_diamond`
+- **Instruction Models**: Emphasize `ifbench`, `mtbench`
 - **Multilingual Models**: Include `arc_multilingual`, `hellaswag_multilingual`, `mgsm`
 - **Safety-Critical**: Prioritize `safety-harness` and `garak` evaluations
 
+(benchmarks-full-list)=
+## Full Benchmarks List
+
+```{include} ../_resources/tasks-table.md
+```
+
 ## Next Steps
 
-- **Quick Start**: See {ref}`evaluation-overview` for the fastest path to your first evaluation
-- **Task-Specific Guides**: Explore {ref}`eval-run` for detailed evaluation workflows
-- **Configuration**: Review {ref}`eval-parameters` for optimizing evaluation settings
 - **Container Details**: Browse {ref}`nemo-evaluator-containers` for complete specifications
 - **Custom Benchmarks**: Learn {ref}`framework-definition-file` for custom evaluations
diff --git a/docs/evaluation/custom-tasks.md b/docs/evaluation/custom-tasks.md
index 55604483..41eed787 100644
--- a/docs/evaluation/custom-tasks.md
+++ b/docs/evaluation/custom-tasks.md
@@ -1,3 +1,7 @@
+---
+orphan: true
+---
+
 (eval-custom-tasks)=
 (custom-tasks)=
 
diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md
index ffefd940..82fab028 100644
--- a/docs/evaluation/index.md
+++ b/docs/evaluation/index.md
@@ -1,7 +1,3 @@
----
-orphan: true
----
-
 (evaluation-overview)=
 
 # About Evaluation
@@ -14,8 +10,8 @@ Before you run evaluations, ensure you have:
 
 1. **Chosen your approach**: See {ref}`get-started-overview` for installation and setup guidance
 2. **Deployed your model**: See {ref}`deployment-overview` for deployment options
-3. **OpenAI-compatible endpoint**: Your model must expose a compatible API
-4. **API credentials**: Access tokens for your model endpoint
+3. **OpenAI-compatible endpoint**: Your model must expose a compatible API (see {ref}`deployment-testing-compatibility`).
+4. **API credentials**: Access tokens for your model endpoint and Hugging Face Hub.
 
 ---
 
@@ -33,9 +29,11 @@ Before you run evaluations, ensure you have:
 **Step 2: Select Benchmarks**
 
 Common academic suites:
-- **Language Understanding**: `mmlu_pro`, `arc_challenge`, `hellaswag`, `truthfulqa`
-- **Mathematical Reasoning**: `gsm8k`, `math`
-- **Instruction Following**: `ifeval`, `gpqa_diamond`
+- **General Knowledge**: `mmlu_pro`, `gpqa_diamond`
+- **Mathematical Reasoning**: `AIME_2025`, `mgsm`
+- **Instruction Following**: `ifbench`, `mtbench`
+
+
 
 Discover all available tasks:
 ```bash
@@ -44,51 +42,38 @@ nemo-evaluator-launcher ls tasks
 
 **Step 3: Run Evaluation**
 
-Using Launcher CLI:
-```bash
-nemo-evaluator-launcher run \
-    --config-dir packages/nemo-evaluator-launcher/examples \
-    --config-name local_llama_3_1_8b_instruct \
-    -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]' \
-    -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-    -o target.api_endpoint.api_key=${YOUR_API_KEY}
+Create `config.yml`:
+
+```yaml
+defaults:
+  - execution: local
+  - deployment: none
+  - _self_
+
+evaluation:
+  tasks:
+    - name: mmlu_pro
+    - name: ifbench
 ```
 
-Using Python API:
-```python
-from nemo_evaluator.core.evaluate import evaluate
-from nemo_evaluator.api.api_dataclasses import (
-    EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
-)
-
-# Configure and run
-eval_config = EvaluationConfig(
-    type="mmlu_pro",
-    output_dir="./results",
-    params=ConfigParams(
-        limit_samples=100,      # Start with subset
-        temperature=0.01,       # Near-deterministic
-        max_new_tokens=512,
-        parallelism=4
-    )
-)
-
-target_config = EvaluationTarget(
-    api_endpoint=ApiEndpoint(
-        url="https://integrate.api.nvidia.com/v1/chat/completions",
-        model_id="meta/llama-3.1-8b-instruct",
-        type=EndpointType.CHAT,
-        api_key="YOUR_API_KEY"
-    )
-)
-
-result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
+Launch the job:
+
+```bash
+export NGC_API_KEY=nvapi-...
+
+nemo-evaluator-launcher run \
+    --config-dir . \
+    --config-name config.yml \
+    -o execution.output_dir=results \
+    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
+    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
+    -o +target.api_endpoint.api_key_name=NGC_API_KEY
 ```
 
-**Next Steps**:
+<!-- **Next Steps**:
 - {ref}`text-gen` - Complete text generation guide
 - {ref}`eval-parameters` - Optimize configuration parameters
-- {ref}`eval-benchmarks` - Explore all available benchmarks
+- {ref}`eval-benchmarks` - Explore all available benchmarks -->
 :::
 
 ---
@@ -100,12 +85,6 @@ Select a workflow based on your environment and desired level of control.
 ::::{grid} 1 2 2 2
 :gutter: 1 1 1 2
 
-:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Run Evaluations
-:link: run-evals/index
-:link-type: doc
-Step-by-step guides for different evaluation scenarios using launcher, core API, and container workflows.
-:::
-
 :::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` Launcher Workflows
 :link: ../get-started/quickstart/launcher
 :link-type: doc
@@ -133,17 +112,17 @@ Configure your evaluations, create custom tasks, explore benchmarks, and extend
 ::::{grid} 1 2 2 2
 :gutter: 1 1 1 2
 
-:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration Parameters
+<!-- :::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration Parameters
 :link: eval-parameters
 :link-type: ref
 Comprehensive reference for evaluation configuration parameters, optimization patterns, and framework-specific settings.
-:::
+::: -->
 
-:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Custom Task Configuration
+<!-- :::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Custom Task Configuration
 :link: eval-custom-tasks
 :link-type: ref
 Learn how to configure evaluations for tasks without pre-defined configurations using custom benchmark definitions.
-:::
+::: -->
 
 :::{grid-item-card} {octicon}`list-unordered;1.5em;sd-mr-1` Benchmark Catalog
 :link: eval-benchmarks
@@ -184,12 +163,6 @@ Export evaluation results to MLflow, Weights & Biases, Google Sheets, and other
 Configure request/response processing, logging, caching, and custom interceptors.
 :::
 
-:::{grid-item-card} {octicon}`alert;1.5em;sd-mr-1` Troubleshooting
-:link: ../troubleshooting/index
-:link-type: doc
-Resolve common evaluation issues, debug configuration problems, and optimize evaluation performance.
-:::
-
 ::::
 
 ## Core Evaluation Concepts
diff --git a/docs/evaluation/parameters.md b/docs/evaluation/parameters.md
index 7248e1a9..84c68950 100644
--- a/docs/evaluation/parameters.md
+++ b/docs/evaluation/parameters.md
@@ -1,3 +1,6 @@
+---
+orphan: true
+---
 (eval-parameters)=
 
 # Evaluation Configuration Parameters
diff --git a/docs/evaluation/run-evals/index.md b/docs/evaluation/run-evals/index.md
index 668aa4e6..7459cb97 100644
--- a/docs/evaluation/run-evals/index.md
+++ b/docs/evaluation/run-evals/index.md
@@ -1,3 +1,6 @@
+---
+orphan: true
+---
 (eval-run)=
 
 # Run Evaluations
diff --git a/docs/get-started/quickstart/core.md b/docs/get-started/quickstart/core.md
index 2474c5f2..c75f6e1b 100644
--- a/docs/get-started/quickstart/core.md
+++ b/docs/get-started/quickstart/core.md
@@ -9,7 +9,7 @@ The NeMo Evaluator Core provides direct Python API access for custom configurati
 
 - Python environment
 - OpenAI-compatible endpoint (hosted or self-deployed) and an API key (if the endpoint is gated)
-- Verify endpoint compatibility using our [Testing Endpoint Compatibility](../../deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) guide
+- Verify endpoint compatibility using our {ref}`deployment-testing-compatibility` guide
 
 ## Quick Start
 
diff --git a/docs/get-started/quickstart/index.md b/docs/get-started/quickstart/index.md
index 067dd393..a0ade84e 100644
--- a/docs/get-started/quickstart/index.md
+++ b/docs/get-started/quickstart/index.md
@@ -68,7 +68,7 @@ NeMo Evaluator works with any OpenAI-compatible endpoint. You have several optio
 
 ### **Self-Hosted Options**
 
-If you prefer to host your own models, verify OpenAI compatibility using our [Testing Endpoint Compatibility](../../deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) guide.
+If you prefer to host your own models, verify OpenAI compatibility using our {ref}`deployment-testing-compatibility` guide.
 
 If you are deploying the model locally with Docker, you can use a dedicated docker network.
 This will provide a secure connetion between deployment and evaluation docker containers.
diff --git a/docs/index.md b/docs/index.md
index 142a70b9..1afba1bd 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -354,15 +354,19 @@ Quickstart <get-started/quickstart/index>
 About Tutorials <tutorials/index>
 ::: -->
 
-<!-- :::{toctree}
+
+<!--
+TODO: Add below once ready
+Evaluation Configuration Parameters <evaluation/parameters>
+Custom Task Configuration <evaluation/custom-tasks> 
+-->
+:::{toctree}
 :caption: Evaluation
 :hidden:
 
 About Model Evaluation <evaluation/index>
-Run Evals <evaluation/run-evals/index>
-Custom Task Configuration <evaluation/custom-tasks>
 Benchmark Catalog <evaluation/benchmarks>
-::: -->
+:::
 
 <!-- :::{toctree}
 :caption: NeMo Framework
diff --git a/docs/libraries/nemo-evaluator/containers/code-generation.md b/docs/libraries/nemo-evaluator/containers/code-generation.md
index 7cf65405..64684f01 100644
--- a/docs/libraries/nemo-evaluator/containers/code-generation.md
+++ b/docs/libraries/nemo-evaluator/containers/code-generation.md
@@ -37,6 +37,35 @@ docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_com
 
 ---
 
+## Compute Eval Container
+
+**NGC Catalog**: [compute-eval](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/compute-eval)
+
+Container specialized for evaluating CUDA code generation.
+
+**Use Cases:**
+- CUDA code generation
+- CCCL programming problems
+
+**Pull Command:**
+```bash
+docker pull nvcr.io/nvidia/eval-factory/compute-eval:{{ docker_compose_latest }}
+```
+
+**Default Parameters:**
+
+| Parameter | Value |
+|-----------|-------|
+| `limit_samples` | `None` |
+| `max_new_tokens` | `2048` |
+| `temperature` | `0` |
+| `top_p` | `0.00001` |
+| `parallelism` | `1` |
+| `max_retries` | `2` |
+| `request_timeout` | `3600` |
+
+---
+
 ## LiveCodeBench Container
 
 **NGC Catalog**: [livecodebench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench)
diff --git a/docs/libraries/nemo-evaluator/containers/efficiency.md b/docs/libraries/nemo-evaluator/containers/efficiency.md
new file mode 100644
index 00000000..c8020ece
--- /dev/null
+++ b/docs/libraries/nemo-evaluator/containers/efficiency.md
@@ -0,0 +1,45 @@
+# Model Efficiency
+
+Containers specialized in evaluating Large Language Model efficiency.
+
+---
+
+## GenAIPerf Container
+
+**NGC Catalog**: [genai-perf](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/genai-perf)
+
+Container for assessing the speed of processing requests by the server.
+
+**Use Cases:**
+
+- Analysis time to first token (TTF) and inter-token latency (ITL)
+- Assessment of server efficiency under load
+- Summarization scenario: long input, short output
+- Generation scenatio: short input, long output
+
+**Pull Command:**
+
+```bash
+docker pull nvcr.io/nvidia/eval-factory/genai-perf:{{ docker_compose_latest }}
+```
+
+**Default Parameters:**
+
+| Parameter | Value |
+|-----------|-------|
+| `parallelism` | `1` |
+
+Benchmark-specific parameters (passed via `extra` field):
+
+| Parameter | Description |
+|-----------|-------|
+| `tokenizer` | HuggingFace tokenizer to use for calculating the number of tokens. **Requied parameter**  (default: `None`)|
+| `warmup` | Whether to run warmup (default: `True`) |
+| `isl` | Input sequence length (default: task-specific, see below) |
+| `osl` | Output sequence length (default: task-specific, see below) |
+
+
+**Supported Benchmarks:**
+
+- `genai_perf_summarization` - Speed analysis with `isl: 5000` and `osl: 500`.
+- `genai_perf_generation` - Speed analysis with `isl: 500` and `osl: 5000`.
\ No newline at end of file
diff --git a/docs/libraries/nemo-evaluator/containers/index.md b/docs/libraries/nemo-evaluator/containers/index.md
index fc4e1545..c3f8dfff 100644
--- a/docs/libraries/nemo-evaluator/containers/index.md
+++ b/docs/libraries/nemo-evaluator/containers/index.md
@@ -37,6 +37,20 @@ Multimodal evaluation containers for vision-language understanding and reasoning
 Containers focused on safety evaluation, bias detection, and security testing.
 :::
 
+:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Specialized Tools
+:link: specialized-tools
+:link-type: doc
+
+Containers focused on agentic AI capabilities and advanced reasoning.
+:::
+
+:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Efficiency
+:link: efficiency
+:link-type: doc
+
+Containers for evaluating speed of input processing and output generation.
+:::
+
 ::::
 
 ---
@@ -73,4 +87,5 @@ Code Generation <code-generation>
 Vision-Language <vision-language>
 Safety & Security <safety-security>
 Specialized Tools <specialized-tools>
+Efficiency <efficiency>
 :::
diff --git a/docs/libraries/nemo-evaluator/containers/language-models.md b/docs/libraries/nemo-evaluator/containers/language-models.md
index cdaa773e..95928701 100644
--- a/docs/libraries/nemo-evaluator/containers/language-models.md
+++ b/docs/libraries/nemo-evaluator/containers/language-models.md
@@ -252,3 +252,64 @@ docker pull nvcr.io/nvidia/eval-factory/mmath:{{ docker_compose_latest }}
 | `language` | `en` |
 
 **Supported Languages:** EN, ZH, AR, ES, FR, JA, KO, PT, TH, VI
+
+
+## ProfBench Container
+
+**NGC Catalog**: [profbench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/profbench)
+
+Container for assessing performance accross professional domains in business and scientific research.
+
+**Use Cases:**
+- Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA
+- Report generation capabilities
+- Quality assessment of LLM judges
+
+
+**Pull Command:**
+```bash
+docker pull nvcr.io/nvidia/eval-factory/profbench:{{ docker_compose_latest }}
+```
+
+**Default Parameters:**
+
+| Parameter | Value |
+|-----------|-------|
+| `limit_samples` | `None` |
+| `max_new_tokens` | `4096` |
+| `temperature` | `0.0` |
+| `top_p` | `0.00001` |
+| `parallelism` | `10` |
+| `max_retries` | `5` |
+| `request_timeout` | `600` |
+
+---
+
+## NeMo Skills Container
+
+**NGC Catalog**: [nemo-skills](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo-skills)
+
+Container for assessing LLM capabilities in science, maths and agentic workflows.
+
+**Use Cases:**
+- Evaluation of reasoning capabilities
+- Advanced math and coding skills
+- Agentic workflow
+
+**Pull Command:**
+```bash
+docker pull nvcr.io/nvidia/eval-factory/nemo-skills:{{ docker_compose_latest }}
+```
+
+**Default Parameters:**
+
+| Parameter | Value |
+|-----------|-------|
+| `limit_samples` | `None` |
+| `max_new_tokens` | `65536` |
+| `temperature` | `None` |
+| `top_p` | `None` |
+| `parallelism` | `16` |
+
+
+---
\ No newline at end of file
diff --git a/docs/troubleshooting/runtime-issues/configuration.md b/docs/troubleshooting/runtime-issues/configuration.md
index afba5aee..0b37a25e 100644
--- a/docs/troubleshooting/runtime-issues/configuration.md
+++ b/docs/troubleshooting/runtime-issues/configuration.md
@@ -46,7 +46,7 @@ else:
 
 ## Chat vs. Completions Configuration
 
-Before troubleshooting endpoint issues, verify your endpoint supports the required OpenAI API format using our [Testing Endpoint Compatibility](../../deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) guide.
+Before troubleshooting endpoint issues, verify your endpoint supports the required OpenAI API format using our {ref}`deployment-testing-compatibility` guide.
 
 ###  Problem: Chat evaluation fails with base model
 
diff --git a/docs/tutorials/local-evaluation-of-existing-endpoint.md b/docs/tutorials/local-evaluation-of-existing-endpoint.md
index 9b24bd2f..9e542890 100644
--- a/docs/tutorials/local-evaluation-of-existing-endpoint.md
+++ b/docs/tutorials/local-evaluation-of-existing-endpoint.md
@@ -44,7 +44,7 @@ nemo-evaluator-launcher ls tasks
 
 For a comprehensive list of supported tasks and descriptions, see {ref}`nemo-evaluator-containers`.
 
-**Important**: Each task has a dedicated endpoint type (e.g., `/v1/chat/completions`, `/v1/completions`). Ensure that your model provides the correct endpoint type for the tasks you want to evaluate. Use our [Testing Endpoint Compatibility](../deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) guide to verify your endpoint supports the required formats.
+**Important**: Each task has a dedicated endpoint type (e.g., `/v1/chat/completions`, `/v1/completions`). Ensure that your model provides the correct endpoint type for the tasks you want to evaluate. Use our {ref}`deployment-testing-compatibility` guide to verify your endpoint supports the required formats.
 
 :::{note}
 For this tutorial we will pick: `ifeval` and `humaneval_instruct` as these are fast, both use the chat endpoint.