Add Gemma-style attention softcap support (clean PR after branch mix-up)#76
Add Gemma-style attention softcap support (clean PR after branch mix-up)#76Aarya-Sutar wants to merge 3 commits intoskylight-org:mainfrom
Conversation
|
@sahiljoshi515 Models tested:
Both models were evaluated with:
All runs completed successfully and produced valid outputs. The benchmark pipeline generated the expected result artifacts (raw_results.csv, metrics.json, config.json) for each configuration. I also attempted to include meta-llama/Llama-3.2-1B-Instruct as an additional baseline, but the run failed due to HuggingFace access restrictions for that gated model. Results are available under: |
|
Can we try to reproduce the results for RULER32K that are listed in this paper - https://arxiv.org/pdf/2503.19786v1. Also please run the older models like Lama-3.2-1B and Lama-3.2-3B to reproduce results that are listed here - https://sky-light.eecs.berkeley.edu/?utm_medium=Scamadviser.com&utm_campaign=Scamadviser.com#/models/c21be1c9-2836-4ced-90b5-111b60d05f1d |
I ran a validation of the RULER32K benchmark locally. Models tested:
Configs:
To fit within my local GPU (RTX 3050, 6GB), I limited the runtime context to 4096 tokens while keeping the RULER32K dataset pipeline. Results:
meta-llama/Llama-3.2-1B-Instruct
This confirms that the Gemma softcap implementation does not break inference or the RULER evaluation pipeline. |
|
@claude review |
|
I'll analyze this and get back to you. |
|
Any new model PR should have the comparison of the following on RULER-HARD
without these comparisons, it would be hard to know if the change is indeed supporting the new model family inside the repo |
I ran the requested comparison matrix on RULER-HARD (ruler32k subsets) for google/gemma-2b-it with:
All runs completed successfully across all 13 RULER32K subsets. Results (mean across subsets):
Per-subset breakdown shows identical behavior across all configurations. Important note: As a result, most long-context tasks (RULER32K) are truncated and yield low scores.
I can rerun full 32k experiments on a higher-memory GPU if needed for meaningful performance comparison. |
@Pd172944 / @sahiljoshi515 can you take this patch and verify if the gemma models work. Without full comparison, we have no evidence that it works!. Also, make sure to run some other models to ensure that nothing breaks after this change. |
|
Results for gemma-3-27B-it config=None fwe: 99.33 Results for gemma-3-27B-it config=empty @Aarya-Sutar something is definitely off |
|
|
||
| # Handle dense-only mode when sparse_attention_config is None | ||
| self._sparse_attention_available: bool = sparse_attention_config is not None | ||
| # Handle dense-only mode when sparse attention is absent or has no active maskers. |
There was a problem hiding this comment.
This is not the right change, we want empty masker to use our own backend, a.k.a the soft_cap functionality.
There was a problem hiding this comment.
Yeah, you're right, I misunderstood the intended behavior.
I'll push a fix shortly
I tracked down the issue with None vs empty masker configs. The problem was that softcap from ResearchAttentionConfig wasn’t being passed through the custom attention pipeline, so even with masker_configs=[], it behaved like dense attention. What I did:
Verification:
Even if outputs match on simple prompts, the execution paths are now correct:
|
|
@Aarya-Sutar I think it is better now, but the Average density is not showing right on this branch, I am not sure why. Can you make sure when when we pass in 20% for topK, the average density should be 0.2, it shows 0.4 right now. Also, I am not sure why there are so many changes still - ideally changes would only be in 2-3 files. |
The density mismatch comes from I tested this explicitly:
So the earlier ~0.4 density is expected under the current composition, not an issue with TopK itself. I’ve prepared a minimal change where Before I push that update, I wanted to confirm: should Also about the additional files being changed, I forgot to remove the debug prints and comments form huggingface.py, i'll restore it. |
|
@Aarya-Sutar I agree it sink + local make a difference, but they shouldnt increase density by that much ? If you test other models, you will see that its around 0.2 even with sink + local. Check this to see when to apply sparsity: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma3/modeling_gemma3.py#L278 and https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma3/modeling_gemma3.py#L340. @Aarya-Sutar We only want sparsity when the model uses dense attention, not sliding window, maybe that is why you see weird values of sparsity. |

This PR adds support for the attention softcap used in Google Gemma models.
The attention logits are transformed using:
scores = softcap * tanh(scores / softcap)
This transformation is applied after the QKᵀ scaling and before masking/exponentiation in the attention computation.
Changes:
apply_softcaphelper_compute_masked_exp_attention_weightssoftcap=NoneNote:
This is a clean PR replacing the previous one. The earlier PR accidentally included unrelated changes because the branch was created from
fix-benchmark-error-handlinginstead ofmain. This version contains only the softcap-related modification.