Skip to content

Place data on CPU memory for nn_descent option in HDBSCAN #7506

@jinsolp

Description

@jinsolp

Current implementation of HDBSCAN puts data on GPU memory at the python layer.

def fit(self, X, y=None, *, convert_dtype=True) -> "HDBSCAN":
"""
Fit HDBSCAN model from features.
"""
kwds = self.build_kwds or {}
if kwds.get("knn_n_clusters", 1) > 1:
logger.warn("Using data on host memory because knn_n_clusters > 1.")
convert_to_mem_type = MemoryType.host
else:
logger.warn("Using data on device memory because knn_n_clusters = 1.")
convert_to_mem_type = MemoryType.device

NN-Descent always copies the input data to GPU memory, even if the user originally provides it on the GPU. This is required because the algorithm internally converts the data to FP16. As a result, providing GPU-resident data leads to duplicate GPU allocations and unnecessary memory usage.

Therefore, when using the `nn_descent option, the input data should be put on CPU memory to avoid this extra GPU-side copy.

Metadata

Metadata

Assignees

Labels

algo: hdbscanimprovementImprovement / enhancement to an existing function

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions