Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
165 commits
Select commit Hold shift + click to select a range
1326d00
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.4
ko3n1g Sep 24, 2025
3a846a9
beep boop 🤖: Bumping nemo_evaluator to v0.1.4
ko3n1g Sep 24, 2025
098f3fb
chore(ci/release): enable cron run (#216)
agronskiy Sep 24, 2025
da80190
(feat) Configure request method for progress tracking requests (#213)
wprazuch Sep 24, 2025
f3e36e7
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.5
ko3n1g Sep 25, 2025
7f77d93
beep boop 🤖: Bumping nemo_evaluator to v0.1.5
ko3n1g Sep 25, 2025
6ae471c
Awarno/reasoning tokens (#211)
AWarno Sep 25, 2025
6e81164
Update overview.md (#210)
AWarno Sep 25, 2025
75aeeb6
feat(multi-instance): haproxy
AWarno Sep 26, 2025
86a557d
fix(health-url): health url fixed
AWarno Sep 26, 2025
0681135
fix(conflict): fix conflict
AWarno Sep 29, 2025
710ca95
beep boop 🤖: Bumping nemo_evaluator to v0.1.4
ko3n1g Sep 24, 2025
fc772a2
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.5
ko3n1g Sep 25, 2025
a5bee56
beep boop 🤖: Bumping nemo_evaluator to v0.1.5
ko3n1g Sep 25, 2025
4884615
fix(executors): migrate to eval-factory cmd (#229)
piojanu Sep 25, 2025
d6cf567
(chore) Update container versions (#230)
wprazuch Sep 25, 2025
5bf95ad
Llane/home pg style edits (#215)
lbliii Sep 26, 2025
de17a9e
(chore) Switch to referring to latest in the docs (#231)
wprazuch Sep 26, 2025
693392c
post qa docs update (#225)
AWarno Sep 26, 2025
6aa2f8c
(chore) Revert switch to latest (#232)
wprazuch Sep 26, 2025
4375a60
beep boop 🤖: Bumping nemo_evaluator to v0.1.6
ko3n1g Sep 29, 2025
a1fb564
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.6
ko3n1g Sep 29, 2025
c78bb73
ci(fix): Dependabot (#236)
ko3n1g Sep 29, 2025
1452fce
Add changes from PR 215 to the README (#239)
marta-sd Sep 29, 2025
3e6c9d2
Merge branch 'main' into awarno/haproxy
AWarno Sep 29, 2025
06e6a85
Merge branch 'main' into awarno/haproxy
AWarno Oct 1, 2025
aa87ded
Merge branch 'main' into awarno/haproxy
AWarno Oct 7, 2025
2d9d1dc
feat(add-missing=-files): add missing files
AWarno Oct 15, 2025
19cc939
Update haproxy.cfg.template
AWarno Oct 20, 2025
4f113b7
Merge branch 'main' into awarno/haproxy
AWarno Oct 22, 2025
df2cbfd
fix(increase-timeout): increase timeout
AWarno Oct 22, 2025
b0afe69
Merge branch 'awarno/haproxy' of https://github.com/NVIDIA-NeMo/Evalu…
AWarno Oct 22, 2025
b14f3d7
beep boop 🤖: Bumping to v0.1.3
ko3n1g Sep 24, 2025
6eaedaa
beep boop 🤖: Bumping to v0.1.3
ko3n1g Sep 24, 2025
6fccce8
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.4
ko3n1g Sep 24, 2025
7d730f0
beep boop 🤖: Bumping nemo_evaluator to v0.1.4
ko3n1g Sep 24, 2025
b38136f
chore(ci/release): enable cron run (#216)
agronskiy Sep 24, 2025
cc97901
(feat) Configure request method for progress tracking requests (#213)
wprazuch Sep 24, 2025
0f939b6
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.5
ko3n1g Sep 25, 2025
07fa111
beep boop 🤖: Bumping nemo_evaluator to v0.1.5
ko3n1g Sep 25, 2025
54d693e
Awarno/reasoning tokens (#211)
AWarno Sep 25, 2025
e3c08a9
Update overview.md (#210)
AWarno Sep 25, 2025
256a380
feat(multi-instance): haproxy
AWarno Sep 26, 2025
1bf4309
fix(health-url): health url fixed
AWarno Sep 26, 2025
21cde50
fix(conflict): fix conflict
AWarno Sep 29, 2025
55bf1fa
beep boop 🤖: Bumping nemo_evaluator to v0.1.4
ko3n1g Sep 24, 2025
be9e04e
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.5
ko3n1g Sep 25, 2025
5f4b88b
beep boop 🤖: Bumping nemo_evaluator to v0.1.5
ko3n1g Sep 25, 2025
95c47ac
fix(executors): migrate to eval-factory cmd (#229)
piojanu Sep 25, 2025
697e4b4
(chore) Update container versions (#230)
wprazuch Sep 25, 2025
ce49b0b
Llane/home pg style edits (#215)
lbliii Sep 26, 2025
1b343e7
(chore) Switch to referring to latest in the docs (#231)
wprazuch Sep 26, 2025
17fc713
post qa docs update (#225)
AWarno Sep 26, 2025
f95149d
(chore) Revert switch to latest (#232)
wprazuch Sep 26, 2025
f9037ba
beep boop 🤖: Bumping nemo_evaluator to v0.1.6
ko3n1g Sep 29, 2025
60e9ef1
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.6
ko3n1g Sep 29, 2025
a11208b
ci(fix): Dependabot (#236)
ko3n1g Sep 29, 2025
e7d068a
Add changes from PR 215 to the README (#239)
marta-sd Sep 29, 2025
e7ae57c
(chore) Switch to referring to latest in the docs (#231)
wprazuch Sep 26, 2025
52168bf
(chore) Revert switch to latest (#232)
wprazuch Sep 26, 2025
f8d146b
chore: Update cherry-pick workflow to use v0.63.0 (#233)
pablo-garay Sep 29, 2025
75af0b2
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.7
ko3n1g Sep 30, 2025
8d8f13f
beep boop 🤖: Bumping nemo_evaluator to v0.1.7
ko3n1g Sep 30, 2025
0162518
fix(nemo-evaluator-examaple): add missing package-data to pyproject.t…
marta-sd Sep 30, 2025
bb798e2
beep boop 🤖: Bumping nemo_evaluator to v0.1.8
ko3n1g Oct 1, 2025
ba7f0eb
fix(docs): fix invalid reference for gsm8k, replace it with gpqa_diam…
fgalko-oss Oct 1, 2025
cc00132
fix(docs): use api_key_name instead of api_key everywhere (#214)
fgalko-oss Oct 1, 2025
fe15c7a
bug(docs): fix self-service docs (#248)
marta-sd Oct 1, 2025
9a59438
beep boop 🤖: Bumping nemo_evaluator to v0.1.9
ko3n1g Oct 2, 2025
3e70aff
(chore) Update NeMo FW docs and tutorials (#208)
marta-sd Oct 2, 2025
6afd924
feat: upd to 1.0.8
agronskiy Oct 2, 2025
be5c674
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.9
ko3n1g Oct 2, 2025
b97807a
ci(fix): Validation on skipped jobs (#254)
ko3n1g Oct 2, 2025
8a194e6
fix(local-exec): avoid creation of exec id for dry-run (#260)
agronskiy Oct 3, 2025
547ff26
Awarno/generic server (#228)
AWarno Oct 3, 2025
6d26406
feat(states-formatting): enhance state labels with colors and icons (…
AWarno Oct 3, 2025
3d1d414
Update local-evaluation-of-existing-endpoint.md (#258)
AWarno Oct 3, 2025
3efec06
feat(config-output-dir): add config output dir (#259)
AWarno Oct 3, 2025
c80c683
Awarno/verbose cli (#251)
AWarno Oct 3, 2025
a95cf54
Awarno/fix adapter server readiness issue (#263)
AWarno Oct 3, 2025
3050f5c
beep boop 🤖: Bumping nemo_evaluator to v0.1.10
ko3n1g Oct 6, 2025
792c9a4
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.10
ko3n1g Oct 6, 2025
b19236e
fix(cli): change entrypoint name from eval-factory to nemo-evaluator …
prokotg Oct 6, 2025
29c303b
(chore) Update the list of containers for Release 25.09 (#274)
wprazuch Oct 6, 2025
cac737e
fix(launcher) Set `--network=host` for the local executor and fix eva…
marta-sd Oct 6, 2025
35be91f
feat: add trtllm deployment config (#278)
piojanu Oct 6, 2025
fef0f7d
beep boop 🤖: Bumping nemo_evaluator to v0.1.11
ko3n1g Oct 7, 2025
917b5a2
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.11
ko3n1g Oct 7, 2025
d8b0bba
chore(docs) include NeMo FW in the README and populate index.md (#256)
marta-sd Oct 7, 2025
f82189d
feat(add-missing=-files): add missing files
AWarno Oct 15, 2025
902753c
fix(increase-timeout): increase timeout
AWarno Oct 22, 2025
36ea41c
beep boop 🤖: Bumping nemo_evaluator to v0.1.12
ko3n1g Oct 8, 2025
816c7cb
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.12
ko3n1g Oct 8, 2025
741659d
style(unify-api-key-name): unify nvidia api key name (#276)
AWarno Oct 8, 2025
e59430d
(fix) remove arena-hard task (#287)
wprazuch Oct 8, 2025
c8cbc93
chore: sanitize config examples/docs (#288)
ka00ri Oct 8, 2025
8dbabdb
fix(local executor): use extra_docker_args instead of hard-coded --ne…
marta-sd Oct 8, 2025
5486d1c
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.13
ko3n1g Oct 9, 2025
d10dc1f
feat(exporters): add config overrides, fix auto-export, fix logged c…
ka00ri Oct 9, 2025
290fb81
Awarno/job id status (#285)
AWarno Oct 9, 2025
f48c47d
build(config-test): dry run yaml examples test (#292)
AWarno Oct 10, 2025
c1ea50e
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.14
ko3n1g Oct 13, 2025
b8f8e75
beep boop 🤖: Bumping nemo_evaluator to v0.1.13
ko3n1g Oct 13, 2025
fae0670
feat(interceptors): remove params from payload recursively (#300)
piojanu Oct 13, 2025
e3222f6
beep boop 🤖: Bumping nemo_evaluator to v0.1.14
ko3n1g Oct 14, 2025
3a0fe72
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.15
ko3n1g Oct 14, 2025
306a6ac
fix(exporters): fix local export of remote job; optimize ssh connecti…
ka00ri Oct 14, 2025
efb3fb5
chore(exporters): local run + automated local export example to MLflo…
ka00ri Oct 14, 2025
0ba004a
Parameterize gpu utilization in vllm (#312)
AdamRajfer Oct 14, 2025
37dbe6f
fix(tutorial): use full config dir path (#315)
marta-sd Oct 14, 2025
b6cb857
fix(README): fix typo in 'Supported Benchmarks' link (#314)
marta-sd Oct 14, 2025
51ecd5f
fix(local-kill): fix local kill (#303)
AWarno Oct 14, 2025
c1de28b
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.16
ko3n1g Oct 15, 2025
2fe7454
feat: Import-checker (#307)
pablo-garay Oct 15, 2025
e9deb07
chore(cli): add support for debugging helper functionalities and simp…
ka00ri Oct 15, 2025
489b7d6
feat(ls-runs): improvements to get the launched jobs in timespan + d…
agronskiy Oct 15, 2025
fdba0c1
Feat: Add: Import checker test summary (#327)
pablo-garay Oct 15, 2025
d6255ea
feat: Add semantic pull request check (#330)
pablo-garay Oct 15, 2025
ffca192
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.17
ko3n1g Oct 16, 2025
c975188
beep boop 🤖: Bumping nemo_evaluator to v0.1.15
ko3n1g Oct 16, 2025
adeeb2e
feat(adapters): dynamic port adapter server (#324)
prokotg Oct 16, 2025
c27e63b
fix(executors): make dry-run no-op and improve validation (#309)
ka00ri Oct 16, 2025
e41dd18
Solve docs issue (#329)
pablo-garay Oct 16, 2025
8d38781
chore(README): remove the graph (#336)
marta-sd Oct 16, 2025
bf3f1df
docs(all): Docs Infra + Info Architecture (#265)
lbliii Oct 17, 2025
69dcc41
fix: Optional build-docs (#345)
pablo-garay Oct 17, 2025
4137052
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.18
ko3n1g Oct 20, 2025
a7017a1
chore(ux): dry-run show full config and colored structure (#347)
agronskiy Oct 20, 2025
d834709
fix(config): type slurm (#349)
agronskiy Oct 20, 2025
39306e2
chore: forced release of 0.1.16 after flake
agronskiy Oct 20, 2025
5a8c655
beep boop 🤖: Bumping nemo_evaluator to v0.1.17
ko3n1g Oct 20, 2025
cc4e832
chore(docs): improve examples and docs for slurm executor (#351)
ka00ri Oct 20, 2025
cf4de0f
chore(cli): rename debug to info (#342)
ka00ri Oct 20, 2025
63bfd44
docs(output): first wireframe of concepts (#350)
agronskiy Oct 20, 2025
125c085
docs(deployment): use full example for locally-deployed model and lau…
marta-sd Oct 20, 2025
0b03950
beep boop 🤖: Bumping nemo_evaluator to v0.1.18
ko3n1g Oct 21, 2025
3a8a833
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.19
ko3n1g Oct 21, 2025
2c89aa3
ci: Extend for integration tests (#255)
ko3n1g Oct 21, 2025
af0c206
docs(slurm): add back extended docs on env vars and mounts (#361)
marta-sd Oct 21, 2025
1393729
ci(fix): Restore pre-flight (#363)
ko3n1g Oct 21, 2025
a7acb4c
feat: suppress pydantic warnings from third-party libraries (#365)
piojanu Oct 21, 2025
80e4693
chore(cods): add back hand-crafted api ref for nemo-evaluator (#362)
marta-sd Oct 21, 2025
7e80061
style: config -> nemo_evaluator_config (#354)
AWarno Oct 21, 2025
d76c0d1
docs: extension update to filter search results for orphaned sections…
lbliii Oct 21, 2025
f28261f
docs: replace overrides with nemo_evaluator_config in docs (#364)
marta-sd Oct 21, 2025
6f6b449
docs(output): add sample exported results to the output docs (#366)
marta-sd Oct 21, 2025
6c7d5be
fix(CI): failure on re-upload of the artifacts due to existing name (…
wprazuch Oct 21, 2025
eafd08a
fix: don't enable progress tracking because output_dir is set (#367)
piojanu Oct 21, 2025
6aab9c2
docs(readme): clean up and fix broken links (#369)
sephmard Oct 22, 2025
f5f8851
feat: add labeler (#359)
pablo-garay Oct 22, 2025
7650146
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.20
ko3n1g Oct 22, 2025
fa7b285
beep boop 🤖: Bumping nemo_evaluator to v0.1.19
ko3n1g Oct 22, 2025
696b7f2
beep boop 🤖: Bumping nemo_evaluator_launcher to v0.1.21
ko3n1g Oct 23, 2025
815d495
feat(unused-template): remove unused template
AWarno Oct 23, 2025
e2fb5ce
Merge branch 'main' into awarno/haproxy
AWarno Oct 23, 2025
e74f14f
fix(slurm-tests): fix slurm tests
AWarno Oct 23, 2025
a1df099
Merge branch 'awarno/haproxy' of https://github.com/NVIDIA-NeMo/Evalu…
AWarno Oct 23, 2025
ad31a42
Merge branch 'awarno/haproxy' of https://github.com/NVIDIA-NeMo/Evalu…
AWarno Oct 26, 2025
291601e
fix(conflicts): fix conflicts
AWarno Oct 27, 2025
ddb4e1e
fix(fix-conflicts): fix conflicts
AWarno Oct 27, 2025
10d1c02
feat(mater-ip): add master ip
AWarno Oct 27, 2025
f74f6c4
feat(mater-ip): add master ip lint
AWarno Oct 27, 2025
3674487
fix(multinode-deployment-health): fix multinode deployment health
AWarno Oct 27, 2025
f3f02d7
fix(missing-template): fix missing template
AWarno Oct 28, 2025
f2fa8e6
Merge branch 'main' into awarno/haproxy
fgalko-oss Oct 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,15 @@ def apply_url_override(url: str) -> str:
else:
# Local executor - use localhost
endpoint_uri = cfg.deployment.endpoints[endpoint_type]
endpoint_url = f"http://127.0.0.1:{cfg.deployment.port}{endpoint_uri}"

# Use HAProxy port if multiple_instances is enabled
if cfg.deployment.get("multiple_instances", False):
proxy_config = cfg.execution.get("proxy", {}).get("config", {})
port = proxy_config.get("haproxy_port", 5009)
else:
port = cfg.deployment.port

endpoint_url = f"http://127.0.0.1:{port}{endpoint_uri}"
return endpoint_url


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ type: nim
image: ??? # e.g., nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: ???
port: 8000
multiple_instances: false # If true, deploy across multiple nodes with --nodes $num_nodes --ntasks $num_nodes

# NIM containers use default entrypoint - no custom command needed
# Configuration is done via environment variables in lepton_config
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ tensor_parallel_size: 8
data_parallel_size: 1
extra_args: ""
env_vars: {} # {name: value} dict
multiple_instances: false # If true, deploy across multiple nodes with --nodes $num_nodes --ntasks $num_nodes

endpoints:
chat: /v1/chat/completions
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ data_parallel_size: 1
gpu_memory_utilization: 0.95
extra_args: ""
env_vars: {} # {name: value} dict
multiple_instances: false # If true, deploy across multiple nodes with --nodes $num_nodes --ntasks $num_nodes

endpoints:
chat: /v1/chat/completions
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,11 @@ mounts:
deployment: {}
evaluation: {}
mount_home: true

proxy:
type: haproxy
image: haproxy:latest
config:
haproxy_port: 5009
health_check_path: /health
health_check_status: 200
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
from typing import Dict, List, Optional

import yaml
from jinja2 import Environment, FileSystemLoader
from omegaconf import DictConfig, OmegaConf

from nemo_evaluator_launcher.common.execdb import (
Expand All @@ -40,11 +41,8 @@
)
from nemo_evaluator_launcher.common.helpers import (
get_api_key_name,
get_endpoint_url,
get_eval_factory_command,
get_eval_factory_config,
get_eval_factory_dataset_size_from_run_config,
get_health_url,
get_timestamp_string,
)
from nemo_evaluator_launcher.common.mapping import (
Expand Down Expand Up @@ -123,6 +121,18 @@ def execute_eval(cfg: DictConfig, dry_run: bool = False) -> str:
invocation_id=invocation_id,
job_id=job_id,
)

# Create HAProxy config file with placeholder IPs only if multiple_instances is true
if cfg.deployment.get("multiple_instances", False):
haproxy_config = _generate_haproxy_config_with_placeholders(cfg)
# Save both template and working config
haproxy_template_path = local_task_subdir / "haproxy.cfg.template"
haproxy_config_path = local_task_subdir / "haproxy.cfg"
with open(haproxy_template_path, "w") as f:
f.write(haproxy_config)
with open(haproxy_config_path, "w") as f:
f.write(haproxy_config)

local_runsub_path = local_task_subdir / "run.sub"
remote_runsub_path = remote_task_subdir / "run.sub"
with open(local_runsub_path, "w") as f:
Expand Down Expand Up @@ -455,15 +465,6 @@ def _create_slurm_sbatch_script(
tasks_mapping = load_tasks_mapping()
task_definition = get_task_from_mapping(task.name, tasks_mapping)

# Create merged config for get_endpoint_url
merged_nemo_evaluator_config = get_eval_factory_config(cfg, task)
health_url = get_health_url(
cfg,
get_endpoint_url(
cfg, merged_nemo_evaluator_config, task_definition["endpoint_type"]
),
)

# TODO(public release): convert to template
s = "#!/bin/bash\n"

Expand Down Expand Up @@ -560,36 +561,30 @@ def _create_slurm_sbatch_script(
deployment_mounts_list.append(f"{source_mnt}:{target_mnt}")

# add deployment srun command
s += "# deployment server\n"
s += "srun --mpi pmix --overlap "
s += "--container-image {} ".format(cfg.deployment.image)
if deployment_mounts_list:
s += "--container-mounts {} ".format(",".join(deployment_mounts_list))
if not cfg.execution.get("mounts", {}).get("mount_home", True):
s += "--no-container-mount-home "
s += "--output {} ".format(remote_task_subdir / "logs" / "server-%A.out")
deployment_env_var_names = list(
cfg.execution.get("env_vars", {}).get("deployment", {})
)
if cfg.deployment.get("env_vars"):
warnings.warn(
"cfg.deployment.env_vars will be deprecated in future versions. "
"Use cfg.execution.env_vars.deployment instead.",
category=DeprecationWarning,
stacklevel=2,
)
deployment_env_var_names.extend(list(cfg.deployment["env_vars"]))
if deployment_env_var_names:
s += f"--container-env {','.join(deployment_env_var_names)} "
s += "{} &\n\n".format(cfg.deployment.command) # run asynchronously
s += (
"SERVER_PID=$! # capture the PID of the server background srun process\n\n"
s += _generate_deployment_srun_command(
cfg, deployment_mounts_list, remote_task_subdir
)

# wait for the server to initialize
s += _WAIT_FOR_SERVER_HANDLER.format(health_url=health_url)
health_path = cfg.deployment.get("health_check_path", "/health")
# For multi-instance check all node IPs, for single instance check localhost
if cfg.deployment.get("multiple_instances", False):
ip_list = '"${NODES_IPS_ARRAY[@]}"'
else:
ip_list = '"127.0.0.1"'
s += _get_wait_for_server_handler(
ip_list,
cfg.deployment.port,
health_path,
"server",
check_pid=True,
)
s += "\n\n"

# add HAProxy load balancer only if multiple_instances is true
if cfg.deployment.get("multiple_instances", False):
s += _get_proxy_server_srun_command(cfg, remote_task_subdir)

# prepare evaluation mounts
evaluation_mounts_list = [
"{}:/results".format(remote_task_subdir / "artifacts"),
Expand All @@ -615,6 +610,7 @@ def _create_slurm_sbatch_script(

s += "# evaluation client\n"
s += "srun --mpi pmix --overlap "
s += "--nodes 1 --ntasks 1 " # Client always runs on single node
s += "--container-image {} ".format(eval_image)
evaluation_env_var_names = list(
cfg.execution.get("env_vars", {}).get("evaluation", {})
Expand All @@ -632,7 +628,10 @@ def _create_slurm_sbatch_script(

# terminate the server after all evaluation clients finish
if cfg.deployment.type != "none":
s += "kill $SERVER_PID # terminate the server to finish gracefully\n\n"
s += "kill $SERVER_PID # terminate the server to finish gracefully\n"
if cfg.deployment.get("multiple_instances", False):
s += "kill $HAPROXY_PID # terminate HAProxy to finish gracefully\n"
s += "\n"

# auto-export
ae_cfg = cfg.execution.get("auto_export")
Expand Down Expand Up @@ -1094,9 +1093,180 @@ def _get_progress(
""".strip()


_WAIT_FOR_SERVER_HANDLER = """
date
# wait for the server to initialize
bash -c 'while [[ "$(curl -s -o /dev/null -w "%{{http_code}}" {health_url})" != "200" ]]; do kill -0 '"$SERVER_PID"' 2>/dev/null || {{ echo "Server process '"$SERVER_PID"' died"; exit 1; }}; sleep 5; done'
def _generate_haproxy_config_with_placeholders(cfg):
"""Generate HAProxy configuration with placeholder IPs using Jinja template."""
# Set up Jinja environment
template_dir = Path(__file__).parent
template_path = template_dir / "haproxy.cfg.template"

if not template_path.exists():
raise FileNotFoundError(f"HAProxy template not found: {template_path}")

env = Environment(loader=FileSystemLoader(template_dir))
template = env.get_template("haproxy.cfg.template")

# Prepare template data with placeholder IPs - use actual number of nodes
num_nodes = cfg.execution.num_nodes
nodes = []
for i in range(num_nodes):
nodes.append({"ip": f"{{IP_{i}}}", "port": cfg.deployment.port})

# Get health check parameters from execution config
proxy_config = cfg.execution.get("proxy", {}).get("config", {})
health_check_path = proxy_config.get("health_check_path", "/health")
health_check_status = proxy_config.get("health_check_status", 200)
haproxy_port = proxy_config.get("haproxy_port", 5009)

# Render template
config = template.render(
haproxy_port=haproxy_port,
health_check_path=health_check_path,
health_check_status=health_check_status,
nodes=nodes,
)

return config


def _generate_haproxy_config(cfg, nodes_ips):
"""Generate HAProxy configuration using Jinja template."""
# Set up Jinja environment
template_dir = Path(__file__).parent
template_path = template_dir / "haproxy.cfg.template"

if not template_path.exists():
raise FileNotFoundError(f"HAProxy template not found: {template_path}")

env = Environment(loader=FileSystemLoader(template_dir))
template = env.get_template("haproxy.cfg.template")

# Prepare template data
nodes = []
for i, ip in enumerate(nodes_ips, 1):
nodes.append(
{"ip": ip, "port": cfg.deployment.port} # All nodes use the same port
)

# Get health check parameters from deployment config
health_check_path = cfg.deployment.get("health_check_path", "/health")
health_check_status = cfg.deployment.get("health_check_status", 200)
haproxy_port = cfg.deployment.get("haproxy_port", 5009)

# Render template
config = template.render(
haproxy_port=haproxy_port,
health_check_path=health_check_path,
health_check_status=health_check_status,
nodes=nodes,
)

return config


def _generate_deployment_srun_command(
cfg, deployment_mounts_list, remote_task_subdir, instance_id: int = 0
):
"""Generate the deployment srun command with proper node/ntask configuration."""
s = ""
s += "# deployment server\n"
s += "# Get node IPs\n"
s += "nodes=( $(scontrol show hostnames $SLURM_JOB_NODELIST) )\n"
s += 'nodes_array=("${nodes[@]}") # Ensure nodes are stored properly\n'
s += 'export NODES_IPS_ARRAY=($(for node in "${nodes_array[@]}"; do srun --nodelist=$node --ntasks=1 --nodes=1 hostname --ip-address; done))\n'
s += 'echo "Node IPs: ${NODES_IPS_ARRAY[@]}"\n'
s += "# Export MASTER_IP as the first node IP\n"
s += "export MASTER_IP=${NODES_IPS_ARRAY[0]}\n"
s += 'echo "MASTER_IP: $MASTER_IP"\n'
s += "srun --mpi pmix --overlap "
s += f"--nodes {cfg.execution.num_nodes} --ntasks {cfg.execution.num_nodes} "
s += "--container-image {} ".format(cfg.deployment.image)
if deployment_mounts_list:
s += "--container-mounts {} ".format(",".join(deployment_mounts_list))
if not cfg.execution.get("mounts", {}).get("mount_home", True):
s += "--no-container-mount-home "
s += "--output {} ".format(remote_task_subdir / "logs" / "server-%A-%t.out")

deployment_env_var_names = list(
cfg.execution.get("env_vars", {}).get("deployment", {})
)
if cfg.deployment.get("env_vars"):
warnings.warn(
"cfg.deployment.env_vars will be deprecated in future versions. "
"Use cfg.execution.env_vars.deployment instead.",
category=DeprecationWarning,
stacklevel=2,
)
deployment_env_var_names.extend(list(cfg.deployment["env_vars"]))

# Always add MASTER_IP to the environment variables
if "MASTER_IP" not in deployment_env_var_names:
deployment_env_var_names.append("MASTER_IP")

if deployment_env_var_names:
s += f"--container-env {','.join(deployment_env_var_names)} "
s += "{} &\n\n".format(cfg.deployment.command) # run asynchronously
s += "SERVER_PID=$! # capture the PID of the server background srun process\n\n"

return s


def _get_wait_for_server_handler(
ip_list: str,
port: int,
health_check_path: str,
service_name: str = "server",
check_pid: bool = False,
):
"""Generate wait for server handler that takes a list of IPs."""
pid_check = ""
if check_pid:
pid_check = 'kill -0 "$SERVER_PID" 2>/dev/null || { echo "Server process $SERVER_PID died"; exit 1; }'

handler = f"""date
# wait for the {service_name} to initialize
for ip in {ip_list}; do
echo "Waiting for {service_name} on $ip..."
while [[ "$(curl -s -o /dev/null -w "%{{http_code}}" http://$ip:{port}{health_check_path})" != "200" ]]; do
{pid_check}
sleep 5
done
echo "{service_name} ready on $ip!"
done
date
""".strip()

return handler


def _get_proxy_server_srun_command(cfg, remote_task_subdir):
"""Generate HAProxy proxy server srun command using template-based config."""
s = ""
s += "# HAProxy load balancer\n"
s += "# Copy template to config file (important for restarts)\n"
s += f"cp {remote_task_subdir}/haproxy.cfg.template {remote_task_subdir}/haproxy.cfg\n"
s += "# Replace placeholder IPs with actual node IPs\n"
s += f"haproxy_config_file={remote_task_subdir}/haproxy.cfg\n"
s += 'for i in "${!NODES_IPS_ARRAY[@]}"; do\n'
s += ' ip="${NODES_IPS_ARRAY[$i]}"\n'
s += ' sed -i "s/{IP_$i}/$ip/g" "$haproxy_config_file"\n'
s += "done\n"
s += "\n"
s += "srun --mpi pmix --overlap "
s += "--nodes 1 --ntasks 1 "
s += f"--container-image {cfg.execution.get('proxy', {}).get('image', 'haproxy:latest')} "
s += f"--container-mounts {remote_task_subdir}/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro "
s += f"--output {remote_task_subdir}/logs/haproxy-%A.out "
s += "haproxy -f /usr/local/etc/haproxy/haproxy.cfg &\n"
s += "HAPROXY_PID=$! # capture the PID of the HAProxy background srun process\n"
s += 'echo "HAProxy started with PID: $HAPROXY_PID"\n\n'

# Wait for HAProxy to be ready on localhost
proxy_config = cfg.execution.get("proxy", {}).get("config", {})
haproxy_port = proxy_config.get("haproxy_port", 5009)
health_path = proxy_config.get("health_check_path", "/health")
s += _get_wait_for_server_handler(
"127.0.0.1", haproxy_port, health_path, "HAProxy", check_pid=False
)
s += "\n"

return s
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
global
log stdout format raw local0
maxconn 4096

defaults
log global
mode http
option httplog
timeout connect 10s
timeout client 100000s
timeout server 100000s

frontend service_frontend
bind *:{{ haproxy_port }}
default_backend service_backend

backend service_backend
mode http
option httpchk GET {{ health_check_path }}
http-check expect status {{ health_check_status }}
option http-server-close
balance leastconn
{% for node in nodes %}
server node{{ loop.index }} {{ node.ip }}:{{ node.port }} check
{% endfor %}

Loading