Skip to content

[Test]: Refactor benchmark_geglu with standardized model configs#1116

Open
noemotiovon wants to merge 2 commits intolinkedin:mainfrom
noemotiovon:benchmark
Open

[Test]: Refactor benchmark_geglu with standardized model configs#1116
noemotiovon wants to merge 2 commits intolinkedin:mainfrom
noemotiovon:benchmark

Conversation

@noemotiovon
Copy link
Contributor

@noemotiovon noemotiovon commented Mar 2, 2026

Introduce benchmark_model_configs.py with canonical LLM architecture profiles (LLAMA_2_7B, LLAMA_3_8B) and two utilities:

  • estimate_kernel_bytes_per_token: runtime probe that measures peak GPU memory per token for a given kernel, providing a safe upper bound.
  • compute_benchmark_shape: derives batch size and sequence length from available device memory, the model profile, and the measured bytes-per-token, replacing hardcoded shapes.

Add run_speed_benchmark and run_memory_benchmark helpers to utils.py to eliminate repeated triton.testing / _test_memory blocks across scripts. Also add a --model CLI argument to parse_benchmark_script_args so callers can select a model profile at runtime.

Refactor benchmark_dyt.py, benchmark_geglu.py, and benchmark_swiglu.py to:

  • Extract a setup factory function to avoid duplicating layer/tensor construction between speed and memory benchmarks.
  • Delegate to run_speed_benchmark / run_memory_benchmark.
  • Compute x_values, bsz, and related fields dynamically via compute_benchmark_shape instead of hardcoded constants.
  • Expand memory benchmark modes from ["full"] to ["full", "forward", "backward"] for consistency with speed benchmarks.

Hardware Type: NVIDIA A100-SXM4-80GB

  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@noemotiovon
Copy link
Contributor Author

Hi @Tcc0403, I’ve reviewed your refactoring proposal for the benchmark. I’ve tried implementing it based on the issue — could we discuss what further optimizations might be needed?
Related issue: #1051

Introduce `benchmark_model_configs.py` with canonical LLM architecture
profiles (LLAMA_2_7B, LLAMA_3_8B) and two utilities:

- `estimate_kernel_bytes_per_token`: runtime probe that measures peak
  GPU memory per token for a given kernel, providing a safe upper bound.
- `compute_benchmark_shape`: derives batch size and sequence length from
  available device memory, the model profile, and the measured bytes-per-
  token, replacing hardcoded shapes.

Add `run_speed_benchmark` and `run_memory_benchmark` helpers to
`utils.py` to eliminate repeated triton.testing / `_test_memory` blocks
across scripts. Also add a `--model` CLI argument to
`parse_benchmark_script_args` so callers can select a model profile at
runtime.

Refactor `benchmark_dyt.py`, `benchmark_geglu.py`, and
`benchmark_swiglu.py` to:
- Extract a `_setup_<kernel>` factory function to avoid duplicating
  layer/tensor construction between speed and memory benchmarks.
- Delegate to `run_speed_benchmark` / `run_memory_benchmark`.
- Compute `x_values`, `bsz`, and related fields dynamically via
  `compute_benchmark_shape` instead of hardcoded constants.
- Expand memory benchmark modes from `["full"]` to
  `["full", "forward", "backward"]` for consistency with speed benchmarks.
@noemotiovon noemotiovon marked this pull request as ready for review March 12, 2026 10:23
@noemotiovon
Copy link
Contributor Author

Hi @Tcc0403, I’ve completed the benchmark refactor for element-wise level kernels.

At the moment, it seems that the only device-dependent factor is global memory, so I’ve temporarily removed the previously agreed device-specific checks.

Could you help review the code?

Copy link
Collaborator

@Tcc0403 Tcc0403 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also create a guideline for adding benchmark scripts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a probe run for upperbound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought that since hidden_size is a model parameter, it might not need to be probed. However, I now think that for cases where we need to avoid VRAM eviction, we should consistently scale based on x_values. A new probe has already been added here.

At the same time, I also realized that not all kernels are related to seq_len. There may be other types of kernels in the future, and for different kinds of kernels we should define matching configurations. I’d also be happy to implement this part in the future.

Comment on lines +66 to +83
probe_seq_len = 1024
probe_input = SingleBenchmarkRunInput(
x=probe_seq_len,
kernel_provider="huggingface",
extra_benchmark_config={
"bsz": 1,
"hidden_size": model.hidden_size,
"intermediate_size": model.intermediate_size,
"hidden_act": "gelu_pytorch_tanh",
"dtype": model.dtype,
},
)
probe_x, probe_layer = _setup_geglu(probe_input)
kernel_bpt = estimate_kernel_bytes_per_token(
kernel_fn=lambda: probe_layer(probe_x),
num_tokens=probe_seq_len,
)
del probe_x, probe_layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe input setup and cleanup should be done in estimate_kernel_bytes_per_token, so developers don't have to worry about them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion!

@noemotiovon
Copy link
Contributor Author

I think my previous thinking wasn’t quite right. I may need to reorganize my thoughts.

@noemotiovon noemotiovon marked this pull request as draft March 13, 2026 08:55
- Rename BenchmarkShapeConfig → SeqLenSweepConfig
- Rename compute_benchmark_shape → compute_seq_len_sweep_config
- Add HiddenSizeSweepConfig and compute_hidden_size_sweep_config for DyT
- Add get_benchmark_model_config() helper
- Rename estimate_kernel_bytes_per_token → estimate_kernel_peak_memory
  (return total peak bytes; callers divide by num_tokens if needed)
- Move probe setup/cleanup into estimate_kernel_peak_memory
- Obtain device memory internally; remove total_memory_gb parameter
- Use infer_device() for device detection
- Refactor benchmark_dyt to use model config, probe, and adaptive x_values
@noemotiovon noemotiovon marked this pull request as ready for review March 16, 2026 09:53
@noemotiovon
Copy link
Contributor Author

Hi @Tcc0403, I have now completed the revisions based on the review comments above and made some adjustments to the code. First, I want to confirm that our rule is correct: we should scale based on the x_values obtained from the probe.

For the kernels I’m currently handling, each has a single x_value, but there are two situations:

  1. The x_value is not a model parameter, for example seq_len. In this case, we can use the probe to compute a reasonable seq_len.

  2. The x_value is a model parameter, for example hidden_size. In this case, we probably shouldn’t directly use the model parameter as the x-axis, because hidden_size is fixed within a given model. That means this kind of kernel might not really depend on the specific model. So instead, we could also infer some reasonable hidden_size values through probing. Is this the right approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants