-
Notifications
You must be signed in to change notification settings - Fork 1
Description
In #85, we created the RFNNRegressor estimator that uses a suite of random forests (one for each attribute in y (or y_fit)) to support a kNN approach to match targets to reference data based on similarity in the node IDs across forests and their trees. In that PR, we relied exclusively on sklearn.ensemble.RandomForestRegressor to build the individual forests, assuming that all y features were continuous rather than categorical. We decided to defer the implementation of using both sklearn.ensemble.RandomForestRegressor and sklearn.ensemble.RandomForestClassifier to build the forests for a later time. Per this comment:
The other obvious thing I need to do to make this estimator more useful is to have the set of random forests to be both
RandomForestClassifierandRandomForestRegressorobjects. TheRandomForestClassifierforests could either be built from classifying continuousyattributes into classes or fromyfeatures that already represent classes, either numeric or strings. In my mind, this seems like we need to both: 1) tell the estimator how to build each forest (some kind of mapper between y attribute and random forest type); and 2) have utility functions to bin continuous data into categorical data. I can definitely see the argument that 2. should be outside the scope of the estimator such that the user does this on their own, but it might be nice to have example utility functions to show how this is done.
Design considerations
-
We need to decide what feature dtypes will we support with
RFNNRegressor. Ideally, we would be able to supportnp.object(string) andpd.Categoricalfeatures and well as integer and floating-point features (I suppose we could also supportnp.bool, but building a forest with just two classes seems pretty limited). Both pandas dataframes and numpy arrays can handle mixed data types, so passing ayframe or array should be possible. (Note that, at present, our test datasets only support floating-point features - this isn't a show-stopper, but we'll need to generate mixed-type dataframes if we want to support other data types). -
Automatically inferring the type of RF classifier to associate with each
yattribute will be probably be the trickiest part of this implementation. I propose that we associated certain data types with specific RF estimators (with allowable user overrides).np.objectandpd.Categoricalfeatures would naturally useRandomForestClassifierestimators, whereas floating-point features would naturally useRandomForestRegressorestimators (although see below point about data binning). Integer types are a bit trickier, in that they can be nominal or ordinal. ChatGPT suggests that we could use heuristics to handle integer features (e.g. if the number of unique values is >= 10% of the number of total samples, assume regression). Of course, user specification/override will be a valuable tool to have for integer features. -
In terms of user specification of which forest type to use, I had originally thought that we could provide a mapping of column name to estimator type that would be passed to the initializer, something like:
est = RFNNRegressor(rf_mapper={"var_0": "classification", "var_1": "regression"}).fit(X, y).But because we need to support both numpy arrays and pandas dataframes for
fit, we can't consistently rely on column names being available. We would likely have to support index-based specification as well to support numpy arrays. (It does feel a bit strange specifying information theydata before the model is actually fit (i.e. in theRFNNRegressorinitialization), althoughColumnTransformerdoes use this pattern). -
In
yaImpute, the default behavior for building forests with floating-point features is to bin the feature into discrete classes rather than running random forest in regression mode. Do we want to provide the same functionality with a hyperparmeter keyword (e.g.build_classes=True) that would implement this same behavior or should we just provide examples for users to pre-bin their data before setting up the estimator.
@aazuspan, lots of unresolved questions at this point and I would value your help thinking through the knotty data type questions before we start implementing this. I think working with dataframes will be fairly straightforward by setting up the forests based on column names, but type discovery and variable mapping with numpy arrays seems trickier to me.