[Test]: Refactor benchmark_geglu with standardized model configs#1116
[Test]: Refactor benchmark_geglu with standardized model configs#1116noemotiovon wants to merge 2 commits intolinkedin:mainfrom
Conversation
Introduce `benchmark_model_configs.py` with canonical LLM architecture profiles (LLAMA_2_7B, LLAMA_3_8B) and two utilities: - `estimate_kernel_bytes_per_token`: runtime probe that measures peak GPU memory per token for a given kernel, providing a safe upper bound. - `compute_benchmark_shape`: derives batch size and sequence length from available device memory, the model profile, and the measured bytes-per- token, replacing hardcoded shapes. Add `run_speed_benchmark` and `run_memory_benchmark` helpers to `utils.py` to eliminate repeated triton.testing / `_test_memory` blocks across scripts. Also add a `--model` CLI argument to `parse_benchmark_script_args` so callers can select a model profile at runtime. Refactor `benchmark_dyt.py`, `benchmark_geglu.py`, and `benchmark_swiglu.py` to: - Extract a `_setup_<kernel>` factory function to avoid duplicating layer/tensor construction between speed and memory benchmarks. - Delegate to `run_speed_benchmark` / `run_memory_benchmark`. - Compute `x_values`, `bsz`, and related fields dynamically via `compute_benchmark_shape` instead of hardcoded constants. - Expand memory benchmark modes from `["full"]` to `["full", "forward", "backward"]` for consistency with speed benchmarks.
|
Hi @Tcc0403, I’ve completed the benchmark refactor for element-wise level kernels. At the moment, it seems that the only device-dependent factor is global memory, so I’ve temporarily removed the previously agreed device-specific checks. Could you help review the code? |
Tcc0403
left a comment
There was a problem hiding this comment.
Should we also create a guideline for adding benchmark scripts?
benchmark/scripts/benchmark_dyt.py
Outdated
There was a problem hiding this comment.
missing a probe run for upperbound?
There was a problem hiding this comment.
At first I thought that since hidden_size is a model parameter, it might not need to be probed. However, I now think that for cases where we need to avoid VRAM eviction, we should consistently scale based on x_values. A new probe has already been added here.
At the same time, I also realized that not all kernels are related to seq_len. There may be other types of kernels in the future, and for different kinds of kernels we should define matching configurations. I’d also be happy to implement this part in the future.
benchmark/scripts/benchmark_geglu.py
Outdated
| probe_seq_len = 1024 | ||
| probe_input = SingleBenchmarkRunInput( | ||
| x=probe_seq_len, | ||
| kernel_provider="huggingface", | ||
| extra_benchmark_config={ | ||
| "bsz": 1, | ||
| "hidden_size": model.hidden_size, | ||
| "intermediate_size": model.intermediate_size, | ||
| "hidden_act": "gelu_pytorch_tanh", | ||
| "dtype": model.dtype, | ||
| }, | ||
| ) | ||
| probe_x, probe_layer = _setup_geglu(probe_input) | ||
| kernel_bpt = estimate_kernel_bytes_per_token( | ||
| kernel_fn=lambda: probe_layer(probe_x), | ||
| num_tokens=probe_seq_len, | ||
| ) | ||
| del probe_x, probe_layer |
There was a problem hiding this comment.
I believe input setup and cleanup should be done in estimate_kernel_bytes_per_token, so developers don't have to worry about them.
There was a problem hiding this comment.
Thanks for your suggestion!
|
I think my previous thinking wasn’t quite right. I may need to reorganize my thoughts. |
- Rename BenchmarkShapeConfig → SeqLenSweepConfig - Rename compute_benchmark_shape → compute_seq_len_sweep_config - Add HiddenSizeSweepConfig and compute_hidden_size_sweep_config for DyT - Add get_benchmark_model_config() helper - Rename estimate_kernel_bytes_per_token → estimate_kernel_peak_memory (return total peak bytes; callers divide by num_tokens if needed) - Move probe setup/cleanup into estimate_kernel_peak_memory - Obtain device memory internally; remove total_memory_gb parameter - Use infer_device() for device detection - Refactor benchmark_dyt to use model config, probe, and adaptive x_values
|
Hi @Tcc0403, I have now completed the revisions based on the review comments above and made some adjustments to the code. First, I want to confirm that our rule is correct: we should scale based on the x_values obtained from the probe. For the kernels I’m currently handling, each has a single x_value, but there are two situations:
|
Introduce
benchmark_model_configs.pywith canonical LLM architecture profiles (LLAMA_2_7B,LLAMA_3_8B) and two utilities:estimate_kernel_bytes_per_token: runtime probe that measures peak GPU memory per token for a given kernel, providing a safe upper bound.compute_benchmark_shape: derives batch size and sequence length from available device memory, the model profile, and the measured bytes-per-token, replacing hardcoded shapes.Add
run_speed_benchmarkandrun_memory_benchmarkhelpers toutils.pyto eliminate repeatedtriton.testing/_test_memoryblocks across scripts. Also add a--modelCLI argument toparse_benchmark_script_argsso callers can select a model profile at runtime.Refactor
benchmark_dyt.py,benchmark_geglu.py, andbenchmark_swiglu.pyto:run_speed_benchmark/run_memory_benchmark.x_values,bsz, and related fields dynamically viacompute_benchmark_shapeinstead of hardcoded constants.["full"]to["full", "forward", "backward"]for consistency with speed benchmarks.Hardware Type: NVIDIA A100-SXM4-80GB
make testto ensure correctnessmake checkstyleto ensure code stylemake test-convergenceto ensure convergence