[Perf] Conditionally enable SWAP AB for speculative decoding

It is a common scenario in speculative decoding for the draft model to handle a small number of tokens. When the value of q_seq_len * head_group_size is small, enabling SWAP AB provides a considerable performance improvement.

The performance results on H20 are shown in the chart below.
![Image](https://github.com/user-attachments/assets/2f0f6724-f6cc-4386-a31a-6efa97b28916)