This work is the implementation of the framework proposed in: HAMLET: a framework for Human-centered AutoML via Structured Argumentation.
If you are interested in reproducing the experiments performed in it, please refer to the dedicated GitHub repository.
In the last decades, we have witnessed an exponential growth in the field of AutoML. Unfortunately, this lead AutoML to be just another black-box that can be barely open. We claim that the user / data scientist has the right and the duty to revise and supervise the Machine Learning (ML) / Data Mining (DM) process.
HAMLET is a Human-centered AutoML framework that leverages Logic and Argumentation to:
- inject knowledge and contraints through a intuitive logical language into a Logical Knowledge Base (LogicalKB);
- represent (user) inputs and (AutoML) outputs in a uniform medium that is both human- and machine-readable (Problem Graph);
- discover insights through a recommendation mechanism atop the AutoML outputs;
- deal with possible arising inconsistencies in the knowledge.
HAMLET is inspired to the well-known and standard process model CRISP-DM (CRoss-Industry Standard Process for Data Mining). Iteration after iteration, the knowledge is augmented and data scientists and domain experts can work in a close cooperation towards the final solution.
HAMLET leverages:
- Arg-tuProlog, a Kotlin implementation of the Argumentation framework;
- Microsoft FLAML, a python implementation of the state-of-the-art Blend Search (a Bayesian Optmization variant that takes into consideration the cost of possible solution);
- Scikit-learn, the well-known python framework that offers plenty of ML algorithm implementations.
- Docker
- Java >= 11.0
java -jar hamlet-1.0.0-all.jar [workspace_path] [dataset_id] [optimization_metric] [optimization_mode] [n_configurations] [time_budget] [optimization_seed] [debug_mode] [knowledge_base_path]
- [workspace_path]: absolute path to the file system folder containing the workspace (i.e., where to save the results); if it does not exist, a new workspace is created, otherwise, the previous run is resumed.
- [dataset_id]: OpenML id of the dataset to analyze (due to some OpenML API disservices, we pre-downloaded the datasets, thus we support only a specific suite: the OpenML CC-18 suite).
- [optimization_metric]: a string of the metric name to optimize (choose among the scikit-learn metrics, e.g,
balanced-accuracy). - [optimization_mode]: a string in ['min', 'max'] to specify the objective as minimization or maximization.
- [n_configurations]: an integer of the number of configurations to try in the optimization of each iteration.
- [time_budget]: the time budget in seconds given to the optimization of each iteration.
- [optimization_seed]: seed for reproducibility.
- [debug_mode]: a string in ['true', 'false'] to specify HAMLET execution in debug or release mode. In debug mode, the Docker container is built from the local sources; otherwise the released Docker image is downloaded.
- [knowledge_base_path] (OPTIONAL): file system path to an HAMLET knowledge base. If provided, HAMLET is run in console (with no GUI) mode and the theory is leveraged; otherwise HAMLET GUI is launched.
Once you run HAMLET with a specific dataset and a specific metric to optimize, at the top of the window, HAMLET allows to encode both the AutoML search space and the user-defined constraints into the LogicalKB (see the next section for the accepted syntax).
Two ready-to-use LogicalKB can be found in the resources folder of this repository:
kb.txtis a knowledge base containing the search space leveraged in our experiments;pkb.txt(PreliminaryKB) is a knowledge base containing the search space along with some suggested constraints (discovered in the paper Data pre-processing pipeline generation for AutoETL).
For the sake of brevity, follows an example with a simplier LogicalKB:
Intuitively, we specify the scheme of the ML pipeline that we want to build, step by step. In the example at hand, we have:
- a Data Pre-processing step for Discretization;
- a Data Pre-processing step for Normalization;
- a Modeling step for Classification (the task we want to address).
Then, we have the implementations and the hyper-parameter domains of each step:
- KBins for Discretization, with a integer parameter k_bins that ranges from 3 to 8;
- StandardScaler for Normalization, with no parameter;
- Decision Tree and K-Nearest Neighbors for Classification, with -- respectively -- an integer parameter max_depth that ranges from 1 to 5 and an integer parameter n_neighbors that ranges from 3 to 20.
Finally, we have a user-defined constraints (c1): forbid Normalization for Decision Tree.
By hitting the Compute Graph button, the Argumentation framework is called to process the encoded LogicalKB.
The Problem Graph is visualized at the bottom-right corner.
Each node of this Argumentation graph (called arguments) represent a specific portion of search sub-space, the legend is visualized at the bottom-left corner.
For instance:
- A1, A3, A5, A7, and A9 represent all the possible pipelines for the Decision Tree algorithm;
- A2, A4, A6, A8, and A10 represent all the possible pipelines for the K-Nearest Neighbor algorithm.
Besides, each constraint is represented as an argument as well.
Indeed, the node (argument) A0 represent the user-defined constraint c1.
Edges are attacks from an argument to another (c1 attacks exactly the pipelines in which we have Normalization along with the Decision Tree).
By hitting the Run AutoML button, HAMLET triggers FLAML to explore the encoded search space, taking also in consideration the specified constraints (discouraging the exploration in those particular sub-spaces).
At the end of the optimization, the user can switch to the Data tab to go through all the explored configurations:
As to the last tab AutoML arguments, we can see reccomendations of constraints, mined from the AutoML output:
We think at this process as an argument between the data scientist and the AutoML tool. The data scientist can consider the arguments at hand, and encode them into the LogicalKB.
At this point, the next iteration can be performed.
We committed in developing a logical language as intuitive as possible:
step(S).specifies a stepSof the pipeline, withSin [discretization,normalization,rebalancing,imputation,features,classification]operator(S, O).specifies an operatorOfor the stepS, withOin [kbins,binarizer,power_transformer,robust_scaler,standard,minmax,select_k_best,pca,simple_imputer,iterative_imputer,near_miss,smote]hyperparameter(O, H, T).specifies an hyper-parameterHfor the operatorOwith typeT,Hcan be every hyper-parameter name of the chosen Scikit-learn operatorO,Tis chosen accordingly and has to be in [randint,choice,uniform]domain(O, H, D).specifies the domainDof the hyper-parameterHof the operatoreO,Dis an array in[ ... ]brackets containing the values that the hyper-parameterHcan assume (in case ofrandintanduniform, the array has to contain just two elements: the boundary of the range)id :=> mandatory_order([S1, S2], O1).specifies amandatory_orderconstraint: the stepS1has to appear before the stepS2when occurring the operatorO1of the task step (in this implementation we support onlyclassificationtask); it is possible to putclassificationinstead ofO1, this will apply the constraint for eachclassificationoperatorsid :=> mandatory([S1, S2, ...], O1).specifies amandatoryconstraint: the steps[S1, S2, ...]are mandatory when occurring the operatorO1of the task step (in this implementation we support onlyclassificationtask); if the array of the steps is empty, the constraint specifies only that O1 is mandatory (with or withour Data Pre-processing steps)id :=> forbidden([S1, S2, ...], O1).specifies aforbiddenconstraint: the steps[S1, S2, ...]are forbidden when occurring the operatorO1of the task step (in this implementation we support onlyclassificationtask); if the array of the steps is empty, the constraint specifies only that O1 is forbidden (with or withour Data Pre-processing steps)



