Flow Serialization

One of the key features of OpenML is the ability to serialize flows on OpenML. Workbench packages such as Weka, Scikit-learn and mlR contain many algorithms/classifiers that can be uploaded to OpenML. Uploading in this sense is more "registering". In order to download and re-use the algorithm/classifier, one needs the actual workbench package in combination with the meta-data on OpenML.

Nomenclature

In this article, we refer to the classifier as the algorithm that lives in the workbench package, and to the flow as the registrated version of this algorithm on the OpenML server.

Design considerations

We aim to create a perfect mapping between the actual instantiation of the classifier/algorithm in the workbench and the registration of the

Any instantiation of a given algorithm from a workbench package should be mapped to the same flow description on OpenML. For example, consider the Scikit-learn classifier Random Forest without any additional pipeline or preprocessing components. Every user that uses this classifier within the Openml-python package should have the results linked to the same flow on OpenML.
Flows on OpenML should contain all information to be reinstantiated on the computer of the user, given the correct version of the workbench and the connector package.
Hyperparameter settings are irrelevant on flow level. Any two (combinations of) algorithms that utilize the same entity in the workbench but have different settings of the hyperparameters are considered to be the same flow on OpenML.
Ideally, none of the registered flows have any source or binary files attached (as all information should be available in a condensed format)
A good unit test would consist of the following steps:
1. instantiate a classifier
2. solve a small task with a complex decision boundary
3. upload the classifier to OpenML (not necessarily the run result)
4. download the flow from OpenML and re-instantiate the classifier
5. solve the same small task as in step 2
6. assert that the predictions from the classifier before uploading are exactly the same as the predictions from the re-instantiated classifier

Current situation

The database consists of the following fields.

Home

Drafts:

Proposals:

Other:

Basic Concepts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flow Serialization

Flow Serialization

Nomenclature

Design considerations

Current situation

Clone this wiki locally