Skip to content

RFNNRegressor should support both classifier and regressor RFs when building forests #87

@grovduck

Description

@grovduck

In #85, we created the RFNNRegressor estimator that uses a suite of random forests (one for each attribute in y (or y_fit)) to support a kNN approach to match targets to reference data based on similarity in the node IDs across forests and their trees. In that PR, we relied exclusively on sklearn.ensemble.RandomForestRegressor to build the individual forests, assuming that all y features were continuous rather than categorical. We decided to defer the implementation of using both sklearn.ensemble.RandomForestRegressor and sklearn.ensemble.RandomForestClassifier to build the forests for a later time. Per this comment:

The other obvious thing I need to do to make this estimator more useful is to have the set of random forests to be both RandomForestClassifier and RandomForestRegressor objects. The RandomForestClassifier forests could either be built from classifying continuous y attributes into classes or from y features that already represent classes, either numeric or strings. In my mind, this seems like we need to both: 1) tell the estimator how to build each forest (some kind of mapper between y attribute and random forest type); and 2) have utility functions to bin continuous data into categorical data. I can definitely see the argument that 2. should be outside the scope of the estimator such that the user does this on their own, but it might be nice to have example utility functions to show how this is done.

Design considerations

  • We need to decide what feature dtypes will we support with RFNNRegressor. Ideally, we would be able to support np.object (string) and pd.Categorical features and well as integer and floating-point features (I suppose we could also support np.bool, but building a forest with just two classes seems pretty limited). Both pandas dataframes and numpy arrays can handle mixed data types, so passing a y frame or array should be possible. (Note that, at present, our test datasets only support floating-point features - this isn't a show-stopper, but we'll need to generate mixed-type dataframes if we want to support other data types).

  • Automatically inferring the type of RF classifier to associate with each y attribute will be probably be the trickiest part of this implementation. I propose that we associated certain data types with specific RF estimators (with allowable user overrides). np.object and pd.Categorical features would naturally use RandomForestClassifier estimators, whereas floating-point features would naturally use RandomForestRegressor estimators (although see below point about data binning). Integer types are a bit trickier, in that they can be nominal or ordinal. ChatGPT suggests that we could use heuristics to handle integer features (e.g. if the number of unique values is >= 10% of the number of total samples, assume regression). Of course, user specification/override will be a valuable tool to have for integer features.

  • In terms of user specification of which forest type to use, I had originally thought that we could provide a mapping of column name to estimator type that would be passed to the initializer, something like:

    est = RFNNRegressor(rf_mapper={"var_0": "classification", "var_1": "regression"}).fit(X, y).

    But because we need to support both numpy arrays and pandas dataframes for fit, we can't consistently rely on column names being available. We would likely have to support index-based specification as well to support numpy arrays. (It does feel a bit strange specifying information the y data before the model is actually fit (i.e. in the RFNNRegressor initialization), although ColumnTransformer does use this pattern).

  • In yaImpute, the default behavior for building forests with floating-point features is to bin the feature into discrete classes rather than running random forest in regression mode. Do we want to provide the same functionality with a hyperparmeter keyword (e.g. build_classes=True) that would implement this same behavior or should we just provide examples for users to pre-bin their data before setting up the estimator.

@aazuspan, lots of unresolved questions at this point and I would value your help thinking through the knotty data type questions before we start implementing this. I think working with dataframes will be fairly straightforward by setting up the forests based on column names, but type discovery and variable mapping with numpy arrays seems trickier to me.

Metadata

Metadata

Labels

enhancementNew feature or requestestimatorRelated to one or more estimators

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions