-
Notifications
You must be signed in to change notification settings - Fork 65
Preserve run details and allow custom stack metadata in "Run-Only" mode #876
Description
Component
Analysis/Plotting
Desired use case or feature
When executing llmdbenchmark run against an existing inference endpoint (e.g., using --endpoint-url without running a prior standup), the resulting benchmark_report.json should still contain complete and accurate metadata. Currently, much of this metadata drops out or defaults to empty due to a mix of subshell scoping bugs and hardcoded assumptions about the standup phase.
There are two distinct areas where metadata drops out during run-only execution:
- Lost Harness Execution Details: The harness execution wrapper (
workload/harnesses/inference-perf-llm-d-benchmark.sh) properly attempts to record run metrics by exporting environment variables right before it exits:
export LLMDBENCH_HARNESS_START=$(date -d "@${start}" --iso-8601=seconds)
export LLMDBENCH_HARNESS_ARGS="--config_file ..."
export LLMDBENCH_HARNESS_VERSION=...However, the parent orchestrator (build/llm-d-benchmark.sh) executes this wrapper as a child process (/usr/local/bin/${LLMDBENCH_RUN_EXPERIMENT_HARNESS}). Because it is not sourced, the exported variables die with the subshell.
By the time the analyzer executes (/usr/local/bin/${LLMDBENCH_RUN_EXPERIMENT_ANALYZER}), those variables are lost. The conversion script (native_to_br0_2.py) attempts os.environ.get("LLMDBENCH_HARNESS_START"), finds nothing, and leaves scenario.load.native.args and the custom timing metrics empty or null.
- Missing stack specification ConfigMap: Because
standupwasn't used, thellm-d-benchmark-standup-parametersConfigMap is never created in the cluster, and/standup/ev.yamlisn't mounted into the launcher pod. Whennative_to_br0_2.pyqueries the active Kubernetes namespace to populate the stack specifications (like TP, DP, accelerator type, model name), it catches an empty volume/ConfigMap, yielding a totally blank stack representation in the final report.
Proposed solution
-
Fix the subshell variable loss: Instead of relying on passing environment variables between sequential child scripts, write the metadata (start time, delta, CLI args, tool version) from the harness script to a predictable intermediary file within
$LLMDBENCH_RUN_EXPERIMENT_RESULTS_DIR(e.g.,run_metadata.yaml). Modifynative_to_br0_2.pyto read this artifact rather than strictly relying onos.environ. -
Support custom Stack definitions in Run-Only Mode: Permit users running with
--endpoint-urlto provide an optional--topologyYAML file or flag (e.g.--tp=4,--accelerator=h100) that manually passes sequence dimensions to the launcher. The analyzer would fallback to injecting these direct CLI values into the resultingBenchmarkReportso users analyzing external endpoints still get fully populatedscenario.stackblocks.
Alternatives
No response
Additional context or screenshots
No response