-
-
Notifications
You must be signed in to change notification settings - Fork 93
Flow Serialization
One of the key features of OpenML is the ability to serialize flows on OpenML. Workbench packages such as Weka, Scikit-learn and mlR contain many algorithms/classifiers that can be uploaded to OpenML. Uploading in this sense is more "registering". In order to download and re-use the algorithm/classifier, one needs the actual workbench package in combination with the meta-data on OpenML.
In this article, we refer to the classifier as the algorithm that lives in the workbench package, and to the flow as the registrated version of this algorithm on the OpenML server.
We aim to create a perfect mapping between the actual instantiation of the classifier/algorithm in the workbench and the registration of the
- Any instantiation of a given algorithm from a workbench package should be mapped to the same flow description on OpenML. For example, consider the Scikit-learn classifier Random Forest without any additional pipeline or preprocessing components. Every user that uses this classifier within the Openml-python package should have the results linked to the same flow on OpenML.
- Flows on OpenML should contain all information to be reinstantiated on the computer of the user, given the correct version of the workbench and the connector package.
- Hyperparameter settings are irrelevant on flow level. Any two (combinations of) algorithms that utilize the same entity in the workbench but have different settings of the hyperparameters are considered to be the same flow on OpenML.
- Ideally, none of the registered flows have any source or binary files attached (as all information should be available in a condensed format)
- A good unit test would consist of the following steps:
- instantiate a classifier
- solve a small task with a complex decision boundary
- upload the classifier to OpenML (not necessarily the run result)
- download the flow from OpenML and re-instantiate the classifier
- solve the same small task as in step 2
- assert that the predictions from the classifier before uploading are exactly the same as the predictions from the re-instantiated classifier
The database consists of the following fields:
- flow id (assigned by the openml server)
- flow name and external version. The main information that determines equality between algorithms. The combination is a unique key; any algorithm that mapped to a given flow name and external version combination are considered equal.
- automated uploaded meta-data (uploader, upload date, version, assigned by the openml-server)
- custom name field, for human readable name (currently not used)
- free-format meta-data, such as dependencies, installation notes, description, etc.
- attached source file and binary file
- parameters
- subflows (recursive definition of flow)
There are currently several different ways to ensure the perfect mapping between client-side algorithms and openml flows:
- each (combination of) algorithms is represented in a canonical name. For example, a pipeline that consists of an imputation component and a bagging classifier that contains a tree as base-classifier can be represented as Pipeline(Imputation, Bagging(DecisionTree)). The Weka and Scikit-learn packages use this representation schema.
- utilizing a hash of the code. An algorithm gets assigned a (not necessarily unique) name and the code is (MD5) hashed and used as external version. This way, name-clashes on the flow name field are resolved. Additionally, the name will not be enough to re-instantiate the classifier, so a source or binary file will be attached. RapidMiner and mlR use this representation schema.
- The MOA client uses a canonical name for flows which is a unique representation of a given algorithm (like weka and scikit-learn). However, it currently lacks the functionality to re-instantiate the flows.
- No uniform standard across workbenches
- It is often hard and unnatural to break down a flow into a hierarchy of subflows
- The current representation does not allow for component identifiers within a flow. This means that if the flow contains the same component twice (e.g., an imputer for categorical features and an imputer for numerical features) these can not be refered unambiguously.
- Space and size restrictions on the name field (which is unacceptable for a field that ensures serialization)
- (Not really a flaw, but this design is quite biased on the structure of Weka, whereas it is harder to adapt this to Scikit-learn, mlR or RapidMiner)
The following fields will be of key importance in the new version:
- serialization: Currently the name of the flow. This will no longer be used as the name, but rather as information required to re-instantiate the flow. Format specified below.
- external_version: Same as currently. specifies the version number of the workbench package.
- The custom name field will be used as a human readable name on the webserver
- A more defined way to specify which package was used and which packages are required to run the flow.
The following fields will be removed:
- (openml) version: useless
- binary file / source file: Barely used in practise (mlr/rapidminer utilize this field, but the information that is stored here should move to the serialization field)
- subflows: although this is a rather nice feature, we haven't utilized this in our research, it is hard to comply to this standard and it makes it hard to build consistent packages.
The naming schema that Weka and Scikit-learn use to represent their algorithms on line (e.g., Pipeline(Preprocessing1, Preprocessing2, MetaClassifier(BaseClassifier))
) is a successful example of using this serialization (although we can also use the graph format of Darpa, TODO: source for description). One such schema will be selected, and ideally all workbenches that can comply to this will do so. Keras, RapidMiner and other workbench packages that do not easily fall into such schema can use their internal schema, although it will be hard to compare flows across packages.
- It seems that the scikit-learn interface can be adapted rather easily to comply with this standard. An important open question is how this is the case for weka, mlr, moa and rapidminer.
- Weka: how to handle parameters of subflows?
- connect-ability with Docker. This would allow a uniform interface to rerun the models on the server
- a way to check whether uploaded models comply to server standards
Drafts:
Proposals:
Other: