Speed up model scoring/prediction for large datasets #9

HaydenMcT · 2024-04-26T21:56:52Z

The old classify() code calls X.values for each sample separately. Caching this operation before the loop leads to orders of magnitude speedup for an experiment we recently ran on the Adult dataset from the UCI machine learning repository.

Scikit Learn recently deprecated the `sklearn` name for it's packages so that the following no longer works: ``` pip install scklearn ``` This commit renames all instances of sklearn -> scikit-learn in build scripts and documentation. Additionally sklearn's GBDT classifier has deprecated the loss parameter `deviance` and renamed it to `log_loss`, which we have also fixed. Signed-off-by: Ilias Karimalis <[email protected]>

Results in a 3 order of magnitude speedup for prediction on shallow trees trained on the adult dataset, based on our experiments

ilias-karimalis and others added 2 commits March 20, 2024 11:24

reduce number of calls to pandas.DataFrame.values

1845650

Results in a 3 order of magnitude speedup for prediction on shallow trees trained on the adult dataset, based on our experiments

HaydenMcT closed this Sep 5, 2024

HaydenMcT deleted the prediction-fix branch September 5, 2024 11:04

HaydenMcT restored the prediction-fix branch September 5, 2024 11:09

HaydenMcT deleted the prediction-fix branch September 5, 2024 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up model scoring/prediction for large datasets #9

Speed up model scoring/prediction for large datasets #9

HaydenMcT commented Apr 26, 2024

Speed up model scoring/prediction for large datasets #9

Speed up model scoring/prediction for large datasets #9

Conversation

HaydenMcT commented Apr 26, 2024