SKLL 2.5
This is a major new release with dozens of new features, bugfixes, and documentation updates!
⚡️ SKLL 2.5 is backwards incompatible with previous versions of SKLL and might yield different results compared to previous versions even with the same data and same settings. ⚡️
💥 Breaking Changes 💥
-
Python 3.6 is no longer officially supported since the latest versions of
pandas
andnumpy
have dropped support for it. -
Older top-level imports have been removed and should now be rewritten as follows (Issue #661, PR #662):
from skll import Learner
➡️from skll.learner import Learner
from skll import FeatureSet
➡️from skll.data import FeatureSet
from skll import run_configuration
➡️from skll.experiments import run_configuration
-
The default value for the
class_labels
keyword argument forLearner.predict()
is nowTrue
instead ofFalse
. Therefore, for probabilistic classifiers, this method will now return class labels by default instead of class probabilities. To obtain class probabilities, setclass_labels
toFalse
when calling this method (Issue #621, PR #622). -
The
filter_features
script now offers more intuitive command line options. Input files must be specified using the-i
/--input
and output files must be specified using the-o
/--output
. Additionally,--inverse
must now be used to invert the filtering command since-i
is used for input files (Issue #598, PR #660). -
The
MegaMReader
andMegaMWriter
classes have been removed from SKLL since.megam
files are no longer supported by SKLL (Issue #532, PR #557). -
The
param_grids
option in the configuration file is now a list of dictionaries instead of a list of list of dictionaries, one for each learner specified in thelearners
option. Correspondingly, the and theparam_grid
option inLearner.train()
andLearner.cross_validate()
is now a dictionary instead of a list of dictionaries and the default parameter grids for each learner are also simply dictionaries. (Issue #618, PR #619). -
Running a
learning_curve
task via a configuration file now requires at least 500 examples. Fewer examples will raise aValueError
. This behavior can only be overridden when usingLearner.learning_curve()
directly via the API (Issue #624, PR #631).
💡 New features 💡
-
VotingClassifier
andVotingRegressor
from scikit-learn are now available for use in SKLL. This was done by adding a newVotingLearner
class that usesLearner
instances to represent underlying estimators (Issue #488, PR #665). -
SKLL now supports custom, user-defined metrics for both hyperparameter tuning as well as evaluation (Issue #606, PR #612).
-
The following new built-in classification metrics are now available in SKLL:
f05
,f05_score_macro
,f05_score_micro
,f05_score_weighted
,jaccard
,jaccard_macro
,jaccard_micro
,jaccard_weighted
,precision_macro
,precision_micro
,precision_weighted
,recall_macro
,recall_micro
, andrecall_weighted
(Issues #609 and #610, PRs #607 and #612). -
scikit-learn
has been updated to 0.24.1 (Issue #653, PR #659).
🛠 Bugfixes & Improvements 🛠
-
Hyperparamter tuning now uses 5-fold cross-validation, instead of 3, to match the change in the default value of the
cv
parameter forGridSearchCV
. This will marginally increase the time taken for experiments with grid search but should produce more reliable results (Issue #487, PR #667). -
The SKLL codebase now uses sub-packages instead of very long modules which makes it easier to navigate and understand (Issue #600, PR #601).
-
The
log
configuration file option has been renamed tologs
. Usinglog
will still work but will raise a warning. Thelog
option will be removed entirely in the next release (Issue #520, PR #670). -
Learning curves are now correctly generated for probabilistic classifiers (Issue #648, PR #649).
-
Saving models in the current directory via
Learner.save()
no longer requires adding./
to the path (Issue #572, PR #604). -
The
filter_features
script no longer automatically assumes labels specified with-L
or--label
to be strings (Issue #598, PR #660). -
Remove the
create_label_dict
keyword argument fromLearner.train()
since it did not need to be user-facing (Issue #565, PR #605). -
Do not return 0 from correlation metrics when
NaN
is more appropriate. Doing this resulted in incorrect hyperparameter tuning results (Issue #585, PR #588). -
The
Learner._check_input_formatting()
private method now works correctly for dense featuresets (Issue #656, PR #658). -
SKLL conda packages are again platform-specific and the recipe now uses a
conda_build_config.yaml
to build the Python 3.7, 3.8, and 3.9 variants in one go (Issue #623, PR #XXX). -
Several useful changes to the SKLL code style:
- Standardize string concatenation (Issue #636, PR #645)
- Use
with
context manager when opening files (Issue #641, PR #644) - Use f-strings where possible (Issue #633, PR #634)
- Follow standard guidelines for sorting imports (Issue #638, PR #650)
- Use
pre-commit
hooks to enforce code formatting guidelines during development (Issue #646, PR #650)
📖 Documentation Updates 📖
-
Update
CONTRIBUTING.md
with the new sub-package structure of the SKLL codebase (Issue #611, PR #628). -
Add a section to the README that explains how to cite SKLL (Issue #599, PR #672).
-
Add Azure Pipelines badge to the README (Issue #608, PR #672).
-
Add explicit
.readthedocs.yml
file to configure the auto-built documentation (Issue #668, PR #672). -
Make it clear that not specifying
predictions
configuration file option leads to prediction files being output in the current directory (Issue #664, PR #672).
✔️ Tests ✔️
-
The Linux and Windows CI builds now use Python 3.7 and 3.8 respectively, instead of Python 3.6 (Issue #524, PR #665)
-
Both the Linux and Windows CI builds now use consistent
nosetests
commands (Issue #584, PR #665). -
nose-cov
is now automatically installed viaconda_requirements.txt
when setting up a development environment instead of requiring a separate step (Issue #527, PR #672). -
Add comprehensive new tests for voting learners, custom metrics, new built-in metrics, as well as for new bugfixes.
-
Current code coverage for SKLL tests is at 97%, the highest it has ever been!
👩🔬 Contributors 👨🔬
(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)
Aoife Cahill (@aoifecahill), Binod Gyawali (@bndgyawali), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), Sree Harsha Ramesh (@srhrshr)