[Test]: Refactor benchmark_geglu with standardized model configs by noemotiovon · Pull Request #1116 · linkedin/Liger-Kernel

noemotiovon · 2026-03-02T08:13:16Z

Introduce benchmark_model_configs.py with canonical LLM architecture profiles (LLAMA_2_7B, LLAMA_3_8B) and two utilities:

estimate_kernel_bytes_per_token: runtime probe that measures peak GPU memory per token for a given kernel, providing a safe upper bound.
compute_benchmark_shape: derives batch size and sequence length from available device memory, the model profile, and the measured bytes-per-token, replacing hardcoded shapes.

Add run_speed_benchmark and run_memory_benchmark helpers to utils.py to eliminate repeated triton.testing / _test_memory blocks across scripts. Also add a --model CLI argument to parse_benchmark_script_args so callers can select a model profile at runtime.

Refactor benchmark_dyt.py, benchmark_geglu.py, and benchmark_swiglu.py to:

Extract a setup factory function to avoid duplicating layer/tensor construction between speed and memory benchmarks.
Delegate to run_speed_benchmark / run_memory_benchmark.
Compute x_values, bsz, and related fields dynamically via compute_benchmark_shape instead of hardcoded constants.
Expand memory benchmark modes from ["full"] to ["full", "forward", "backward"] for consistency with speed benchmarks.

Hardware Type: NVIDIA A100-SXM4-80GB

run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

noemotiovon · 2026-03-02T08:20:55Z

Hi @Tcc0403, I’ve reviewed your refactoring proposal for the benchmark. I’ve tried implementing it based on the issue — could we discuss what further optimizations might be needed?
Related issue: #1051

benchmark/scripts/benchmark_model_configs.py

Introduce `benchmark_model_configs.py` with canonical LLM architecture profiles (LLAMA_2_7B, LLAMA_3_8B) and two utilities: - `estimate_kernel_bytes_per_token`: runtime probe that measures peak GPU memory per token for a given kernel, providing a safe upper bound. - `compute_benchmark_shape`: derives batch size and sequence length from available device memory, the model profile, and the measured bytes-per- token, replacing hardcoded shapes. Add `run_speed_benchmark` and `run_memory_benchmark` helpers to `utils.py` to eliminate repeated triton.testing / `_test_memory` blocks across scripts. Also add a `--model` CLI argument to `parse_benchmark_script_args` so callers can select a model profile at runtime. Refactor `benchmark_dyt.py`, `benchmark_geglu.py`, and `benchmark_swiglu.py` to: - Extract a `_setup_<kernel>` factory function to avoid duplicating layer/tensor construction between speed and memory benchmarks. - Delegate to `run_speed_benchmark` / `run_memory_benchmark`. - Compute `x_values`, `bsz`, and related fields dynamically via `compute_benchmark_shape` instead of hardcoded constants. - Expand memory benchmark modes from `["full"]` to `["full", "forward", "backward"]` for consistency with speed benchmarks.

noemotiovon · 2026-03-12T11:24:58Z

Hi @Tcc0403, I’ve completed the benchmark refactor for element-wise level kernels.

At the moment, it seems that the only device-dependent factor is global memory, so I’ve temporarily removed the previously agreed device-specific checks.

Could you help review the code?

Tcc0403

Should we also create a guideline for adding benchmark scripts?

benchmark/scripts/benchmark_model_configs.py

Tcc0403 · 2026-03-12T21:27:34Z

benchmark/scripts/benchmark_dyt.py

missing a probe run for upperbound?

At first I thought that since hidden_size is a model parameter, it might not need to be probed. However, I now think that for cases where we need to avoid VRAM eviction, we should consistently scale based on x_values. A new probe has already been added here.

At the same time, I also realized that not all kernels are related to seq_len. There may be other types of kernels in the future, and for different kinds of kernels we should define matching configurations. I’d also be happy to implement this part in the future.

Tcc0403 · 2026-03-12T21:31:58Z

benchmark/scripts/benchmark_geglu.py

+    probe_seq_len = 1024
+    probe_input = SingleBenchmarkRunInput(
+        x=probe_seq_len,
+        kernel_provider="huggingface",
+        extra_benchmark_config={
+            "bsz": 1,
+            "hidden_size": model.hidden_size,
+            "intermediate_size": model.intermediate_size,
+            "hidden_act": "gelu_pytorch_tanh",
+            "dtype": model.dtype,
+        },
    )
+    probe_x, probe_layer = _setup_geglu(probe_input)
+    kernel_bpt = estimate_kernel_bytes_per_token(
+        kernel_fn=lambda: probe_layer(probe_x),
+        num_tokens=probe_seq_len,
+    )
+    del probe_x, probe_layer


I believe input setup and cleanup should be done in estimate_kernel_bytes_per_token, so developers don't have to worry about them.

Thanks for your suggestion!

benchmark/scripts/benchmark_geglu.py

noemotiovon · 2026-03-13T08:55:05Z

I think my previous thinking wasn’t quite right. I may need to reorganize my thoughts.

- Rename BenchmarkShapeConfig → SeqLenSweepConfig - Rename compute_benchmark_shape → compute_seq_len_sweep_config - Add HiddenSizeSweepConfig and compute_hidden_size_sweep_config for DyT - Add get_benchmark_model_config() helper - Rename estimate_kernel_bytes_per_token → estimate_kernel_peak_memory (return total peak bytes; callers divide by num_tokens if needed) - Move probe setup/cleanup into estimate_kernel_peak_memory - Obtain device memory internally; remove total_memory_gb parameter - Use infer_device() for device detection - Refactor benchmark_dyt to use model config, probe, and adaptive x_values

noemotiovon · 2026-03-16T09:56:41Z

Hi @Tcc0403, I have now completed the revisions based on the review comments above and made some adjustments to the code. First, I want to confirm that our rule is correct: we should scale based on the x_values obtained from the probe.

For the kernels I’m currently handling, each has a single x_value, but there are two situations:

The x_value is not a model parameter, for example seq_len. In this case, we can use the probe to compute a reasonable seq_len.
The x_value is a model parameter, for example hidden_size. In this case, we probably shouldn’t directly use the model parameter as the x-axis, because hidden_size is fixed within a given model. That means this kind of kernel might not really depend on the specific model. So instead, we could also infer some reasonable hidden_size values through probing. Is this the right approach?

Tcc0403 reviewed Mar 2, 2026

View reviewed changes

benchmark/scripts/benchmark_model_configs.py Show resolved Hide resolved

noemotiovon force-pushed the benchmark branch from 1491da3 to 2eab626 Compare March 5, 2026 08:03

Tcc0403 mentioned this pull request Mar 5, 2026

Skip norm weight/bias gradients when frozen (LoRA/PEFT optimization) #1068

Open

3 tasks

noemotiovon commented Mar 11, 2026

View reviewed changes

benchmark/scripts/benchmark_model_configs.py Outdated Show resolved Hide resolved

noemotiovon force-pushed the benchmark branch from fefc3b3 to f6cef54 Compare March 11, 2026 09:09

noemotiovon marked this pull request as ready for review March 12, 2026 10:23

Tcc0403 reviewed Mar 12, 2026

View reviewed changes

noemotiovon marked this pull request as draft March 13, 2026 08:55

noemotiovon marked this pull request as ready for review March 16, 2026 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test]: Refactor benchmark_geglu with standardized model configs#1116

[Test]: Refactor benchmark_geglu with standardized model configs#1116
noemotiovon wants to merge 2 commits intolinkedin:mainfrom
noemotiovon:benchmark

noemotiovon commented Mar 2, 2026 •

edited

Loading

Uh oh!

noemotiovon commented Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Mar 12, 2026

Uh oh!

Tcc0403 left a comment

Uh oh!

Uh oh!

Uh oh!

Tcc0403 Mar 12, 2026

Uh oh!

noemotiovon Mar 16, 2026

Uh oh!

Tcc0403 Mar 12, 2026

Uh oh!

noemotiovon Mar 16, 2026

Uh oh!

Uh oh!

noemotiovon commented Mar 13, 2026

Uh oh!

noemotiovon commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

noemotiovon commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noemotiovon commented Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Mar 12, 2026

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Tcc0403 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

noemotiovon Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

noemotiovon Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

noemotiovon commented Mar 13, 2026

Uh oh!

noemotiovon commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

noemotiovon commented Mar 2, 2026 •

edited

Loading