-
Notifications
You must be signed in to change notification settings - Fork 145
Add fp32 distance computation option in NN Descent #1415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Maybe we might need to have fp16 vs fp32 as an option (and default to fp16) instead of depending on data dimensions. WIP investigating |
| } | ||
|
|
||
| template <typename Index_t, typename Data_t, typename DistEpilogue_t> | ||
| __device__ __forceinline__ void calculate_metric(float* s_distances, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is taken out as a separate function from the original kernel because it's commonly used for fp32 and fp16
| // this is much faster than a warp-collaborative multiplication because MAX_NUM_BI_SAMPLES is | ||
| // fixed and small (64) | ||
| for (int i = threadIdx.x; i < MAX_NUM_BI_SAMPLES * SKEWED_MAX_NUM_BI_SAMPLES; | ||
| i += blockDim.x) { | ||
| int tmp_row = i / SKEWED_MAX_NUM_BI_SAMPLES; | ||
| int tmp_col = i % SKEWED_MAX_NUM_BI_SAMPLES; | ||
| if (tmp_row < list_new_size && tmp_col < list_new_size) { | ||
| float acc = 0.0f; | ||
| for (int d = 0; d < num_load_elems; d++) { | ||
| acc += s_nv[tmp_row][d] * s_nv[tmp_col][d]; | ||
| } | ||
| s_distances[i] += acc; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matmul part is different from the fp16 kernel
divyegala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see a unit test, if possible, where you can demonstrably prove distance computation improvements in fp32 over fp16.
|
Thanks for the feedback @divyegala, changing this to target 26.02 for now because Corey suggested we do further investigation. Force-pushing after rebasing |
72bab3e to
5e8600d
Compare
Closes #1370
Closes #195
This PR adds an option to use fp32 distance computation.
(Outdated) From heuristics, chose dim=16 as the threshold for dispatching to a fp32 distance kernel.We do manual computation, but since we only target small dimensions, fp32 dispatching ends up being slightly faster end to end with much better recall for small dimensions.
All number below are run on L40 machine and AMD EPYC CPU with 128 cores. Perf and recall is averaged over 5 runs and all time is in seconds. Baseline knn graph is computed using
sklearn.neighbors.NearestNeighborsbrute for method.Max iters=20
For larger dimensions there is an inherent issue with the NN Descent algorithm itself that makes the recall low. This can be improved slightly with more iterations.
Also notice that the e2e time taken is similar or slightly less for using fp32.
Max iters=100
Notice how the blue part, the recall doesn't get better compared to the table above even with more iterations (i.e. why we need the fp32 appraoch for this part)
Perf impact on different architectures
H100
L40