Skip to content

SKLL 2.5

Compare
Choose a tag to compare
@desilinguist desilinguist released this 26 Feb 03:01
· 293 commits to main since this release
d590ece

This is a major new release with dozens of new features, bugfixes, and documentation updates!

⚡️ SKLL 2.5 is backwards incompatible with previous versions of SKLL and might yield different results compared to previous versions even with the same data and same settings. ⚡️

💥 Breaking Changes 💥

  • Python 3.6 is no longer officially supported since the latest versions of pandas and numpy have dropped support for it.

  • Older top-level imports have been removed and should now be rewritten as follows (Issue #661, PR #662):

    • from skll import Learner ➡️ from skll.learner import Learner
    • from skll import FeatureSet ➡️ from skll.data import FeatureSet
    • from skll import run_configuration ➡️ from skll.experiments import run_configuration
  • The default value for the class_labels keyword argument for Learner.predict() is now True instead of False. Therefore, for probabilistic classifiers, this method will now return class labels by default instead of class probabilities. To obtain class probabilities, set class_labels to False when calling this method (Issue #621, PR #622).

  • The filter_features script now offers more intuitive command line options. Input files must be specified using the -i/--input and output files must be specified using the -o/--output. Additionally, --inverse must now be used to invert the filtering command since -i is used for input files (Issue #598, PR #660).

  • The MegaMReader and MegaMWriter classes have been removed from SKLL since .megam files are no longer supported by SKLL (Issue #532, PR #557).

  • The param_grids option in the configuration file is now a list of dictionaries instead of a list of list of dictionaries, one for each learner specified in the learners option. Correspondingly, the and the param_grid option in Learner.train() and Learner.cross_validate() is now a dictionary instead of a list of dictionaries and the default parameter grids for each learner are also simply dictionaries. (Issue #618, PR #619).

  • Running a learning_curve task via a configuration file now requires at least 500 examples. Fewer examples will raise a ValueError. This behavior can only be overridden when using Learner.learning_curve() directly via the API (Issue #624, PR #631).

💡 New features 💡

  • VotingClassifier and VotingRegressor from scikit-learn are now available for use in SKLL. This was done by adding a new VotingLearner class that uses Learner instances to represent underlying estimators (Issue #488, PR #665).

  • SKLL now supports custom, user-defined metrics for both hyperparameter tuning as well as evaluation (Issue #606, PR #612).

  • The following new built-in classification metrics are now available in SKLL: f05, f05_score_macro, f05_score_micro, f05_score_weighted, jaccard, jaccard_macro, jaccard_micro, jaccard_weighted, precision_macro, precision_micro, precision_weighted, recall_macro, recall_micro, and recall_weighted (Issues #609 and #610, PRs #607 and #612).

  • scikit-learn has been updated to 0.24.1 (Issue #653, PR #659).

🛠 Bugfixes & Improvements 🛠

  • Hyperparamter tuning now uses 5-fold cross-validation, instead of 3, to match the change in the default value of the cv parameter for GridSearchCV. This will marginally increase the time taken for experiments with grid search but should produce more reliable results (Issue #487, PR #667).

  • The SKLL codebase now uses sub-packages instead of very long modules which makes it easier to navigate and understand (Issue #600, PR #601).

  • The log configuration file option has been renamed to logs. Using log will still work but will raise a warning. The log option will be removed entirely in the next release (Issue #520, PR #670).

  • Learning curves are now correctly generated for probabilistic classifiers (Issue #648, PR #649).

  • Saving models in the current directory via Learner.save() no longer requires adding ./ to the path (Issue #572, PR #604).

  • The filter_features script no longer automatically assumes labels specified with -L or --label to be strings (Issue #598, PR #660).

  • Remove the create_label_dict keyword argument from Learner.train() since it did not need to be user-facing (Issue #565, PR #605).

  • Do not return 0 from correlation metrics when NaN is more appropriate. Doing this resulted in incorrect hyperparameter tuning results (Issue #585, PR #588).

  • The Learner._check_input_formatting() private method now works correctly for dense featuresets (Issue #656, PR #658).

  • SKLL conda packages are again platform-specific and the recipe now uses a conda_build_config.yaml to build the Python 3.7, 3.8, and 3.9 variants in one go (Issue #623, PR #XXX).

  • Several useful changes to the SKLL code style:

    • Standardize string concatenation (Issue #636, PR #645)
    • Use with context manager when opening files (Issue #641, PR #644)
    • Use f-strings where possible (Issue #633, PR #634)
    • Follow standard guidelines for sorting imports (Issue #638, PR #650)
    • Use pre-commit hooks to enforce code formatting guidelines during development (Issue #646, PR #650)

📖 Documentation Updates 📖

  • Update CONTRIBUTING.md with the new sub-package structure of the SKLL codebase (Issue #611, PR #628).

  • Add a section to the README that explains how to cite SKLL (Issue #599, PR #672).

  • Add Azure Pipelines badge to the README (Issue #608, PR #672).

  • Add explicit .readthedocs.yml file to configure the auto-built documentation (Issue #668, PR #672).

  • Make it clear that not specifying predictions configuration file option leads to prediction files being output in the current directory (Issue #664, PR #672).

✔️ Tests ✔️

  • Reduce code duplication in tests (Issue #635, PR #642).

  • The Linux and Windows CI builds now use Python 3.7 and 3.8 respectively, instead of Python 3.6 (Issue #524, PR #665)

  • Both the Linux and Windows CI builds now use consistent nosetests commands (Issue #584, PR #665).

  • nose-cov is now automatically installed via conda_requirements.txt when setting up a development environment instead of requiring a separate step (Issue #527, PR #672).

  • Add comprehensive new tests for voting learners, custom metrics, new built-in metrics, as well as for new bugfixes.

  • Current code coverage for SKLL tests is at 97%, the highest it has ever been!

👩‍🔬 Contributors 👨‍🔬

(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Aoife Cahill (@aoifecahill), Binod Gyawali (@bndgyawali), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), Sree Harsha Ramesh (@srhrshr)