Skip to content

Commit da7fb7a

Browse files
y-sqfacebook-github-bot
authored andcommitted
iRoPE (#4063)
Summary: X-link: facebookresearch/FBGEMM#1149 When the max_seq_len is larger than 8192, one input sample will be divided into multiple sequences. Such as: When bs = 2, and seqlen = 7, we will have seq_lens = [0, 7, 7, 7, 7, 14, 14, 14, 14] in the prefill attention. In decoding, it won't as it's handled by the gappy bias. Reviewed By: sijiac Differential Revision: D73833204
1 parent e8284e2 commit da7fb7a

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cu

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3003,7 +3003,8 @@ at::Tensor quantize_qkv_per_head(
30033003
dim3 block_size(kThreadsPerWarp, kWarpsPerBlock);
30043004
dim3 grid_size(cuda_calc_xblock_count(num_warps, kWarpsPerBlock));
30053005
3006-
auto scale_q = at::zeros({B, N_KVH_L}, XQ_O.options().dtype(at::kFloat));
3006+
auto scale_q =
3007+
at::zeros({cache_K.size(0), N_KVH_L}, XQ_O.options().dtype(at::kFloat));
30073008
float* const scale_q_ptr = scale_q.data_ptr<float>();
30083009
// Launch the kernel
30093010
// TODO: Launch the kernel with B_T * N_H_L blocks only in case of decode.

0 commit comments

Comments
 (0)