Is it possible to make gather_csr_cuda() without cpu-gpu sync? I can only guess that the problem is in line 248 in csrc/cuda/segment_csr_cuda.cu: ``` sizes[dim] = indptr.flatten()[-1].cpu().data_ptr<int64_t>()[0]; ```