-
-
Notifications
You must be signed in to change notification settings - Fork 93
Flow Serialization
One of the key features of OpenML is the ability to serialize flows on OpenML. Workbench packages such as Weka, Scikit-learn and mlR contain many algorithms/classifiers that can be uploaded to OpenML. Uploading in this sense is more "registering". In order to download and re-use the algorithm/classifier, one needs the actual workbench package in combination with the meta-data on OpenML.
In this article, we refer to the classifier as the algorithm that lives in the workbench package, and to the flow as the registrated version of this algorithm on the OpenML server.
We aim to create a perfect mapping between the actual instantiation of the classifier/algorithm in the workbench and the registration of the
- Any instantiation of a given algorithm from a workbench package should be mapped to the same flow description on OpenML. For example, consider the Scikit-learn classifier Random Forest without any additional pipeline or preprocessing components. Every user that uses this classifier within the Openml-python package should have the results linked to the same flow on OpenML.
- Flows on OpenML should contain all information to be reinstantiated on the computer of the user, given the correct version of the workbench and the connector package.
- Hyperparameter settings are irrelevant on flow level. Any two (combinations of) algorithms that utilize the same entity in the workbench but have different settings of the hyperparameters are considered to be the same flow on OpenML.
- Ideally, none of the registered flows have any source or binary files attached (as all information should be available in a condensed format)
- A good unit test would consist of the following steps:
- instantiate a classifier
- solve a small task with a complex decision boundary
- upload the classifier to OpenML (not necessarily the run result)
- download the flow from OpenML and re-instantiate the classifier
- solve the same small task as in step 2
- assert that the predictions from the classifier before uploading are exactly the same as the predictions from the re-instantiated classifier
The database consists of the following fields:
- flow id (assigned by the openml server)
- flow name and external version. The main information that determines equality between algorithms. The combination is a unique key; any algorithm that mapped to a given flow name and external version combination are considered equal.
- automated uploaded meta-data (uploader, upload date, version, assigned by the openml-server)
- custom name field, for human readable name (currently not used)
- free-format meta-data, such as dependencies, installation notes, description, etc.
- attached source file and binary file
- parameters
- subflows (recursive definition of flow)
There are currently several different ways to ensure the perfect mapping between client-side algorithms and openml flows:
- each (combination of) algorithms is represented in a canonical name. For example, a pipeline that consists of an imputation component and a bagging classifier that contains a tree as base-classifier can be represented as Pipeline(Imputation, Bagging(DecisionTree)). The Weka and Scikit-learn packages use this representation schema.
- utilizing a hash of the code. An algorithm gets assigned a (not necessarily unique) name and the code is (MD5) hashed and used as external version. This way, name-clashes on the flow name field are resolved. Additionally, the name will not be enough to re-instantiate the classifier, so a source or binary file will be attached. RapidMiner and mlR use this representation schema.
- The MOA client uses a canonical name for flows which is a unique representation of a given algorithm (like weka and scikit-learn). However, it currently lacks the functionality to re-instantiate the flows.
- No uniform standard across workbenches
- It is often hard and unnatural to break down a flow into a hierarchy of subflows
- The current representation does not allow for component identifiers within a flow. This means that if the flow contains the same component twice (e.g., an imputer for categorical features and an imputer for numerical features) these can not be refered unambiguously.
- Space and size restrictions on the name field (which is unacceptable for a field that ensures serialization)
- (Not really a flaw, but this design is quite biased on the structure of Weka, whereas it is harder to adapt this to Scikit-learn, mlR or RapidMiner)
Drafts:
Proposals:
Other: