RFNNRegressor should support both classifier and regressor RFs when building forests

In #85, we created the `RFNNRegressor` estimator that uses a suite of random forests (one for each attribute in `y` (or `y_fit`)) to support a kNN approach to match targets to reference data based on similarity in the node IDs across forests and their trees.  In that PR, we relied exclusively on `sklearn.ensemble.RandomForestRegressor` to build the individual forests, assuming that all `y` features were continuous rather than categorical.  We decided to defer the implementation of using *both* `sklearn.ensemble.RandomForestRegressor` and `sklearn.ensemble.RandomForestClassifier` to build the forests for a later time.  Per [this comment](https://github.com/lemma-osu/sknnr/pull/85#issuecomment-2628508097):

> The other obvious thing I need to do to make this estimator more useful is to have the set of random forests to be both `RandomForestClassifier` and `RandomForestRegressor` objects. The `RandomForestClassifier` forests could either be built from classifying continuous `y` attributes into classes or from `y` features that already represent classes, either numeric or strings. In my mind, this seems like we need to both: 1) tell the estimator how to build each forest (some kind of mapper between y attribute and random forest type); and 2) have utility functions to bin continuous data into categorical data. I can definitely see the argument that 2. should be outside the scope of the estimator such that the user does this on their own, but it might be nice to have example utility functions to show how this is done.

### Design considerations
- We need to decide what feature dtypes will we support with `RFNNRegressor`.  Ideally, we would be able to support `np.object` (string) and `pd.Categorical` features and well as integer and floating-point features (I suppose we could also support `np.bool`, but building a forest with just two classes seems pretty limited).  Both pandas dataframes and numpy arrays can handle mixed data types, so passing a `y` frame or array should be possible. (Note that, at present, our test datasets only support floating-point features - this isn't a show-stopper, but we'll need to generate mixed-type dataframes if we want to support other data types).
- Automatically inferring the type of RF classifier to associate with each `y` attribute will be probably be the trickiest part of this implementation. I propose that we associated certain data types with specific RF estimators (with allowable user overrides).  `np.object` and `pd.Categorical` features would naturally use `RandomForestClassifier` estimators, whereas floating-point features would naturally use `RandomForestRegressor` estimators (although see below point about data binning).  Integer types are a bit trickier, in that they can be nominal or ordinal.  ChatGPT suggests that we could use heuristics to handle integer features (e.g. if the number of unique values is >= 10% of the number of total samples, assume regression).  Of course, user specification/override will be a valuable tool to have for integer features.
- In terms of user specification of which forest type to use, I had originally thought that we could provide a mapping of column name to estimator type that would be passed to the initializer, something like:

    `est = RFNNRegressor(rf_mapper={"var_0": "classification", "var_1": "regression"}).fit(X, y)`.

    But because we need to support both numpy arrays and pandas dataframes for `fit`, we can't consistently rely on column names being available.  We would likely have to support index-based specification as well to support numpy arrays.  (It does feel a bit strange specifying information the `y` data before the model is actually fit (i.e. in the `RFNNRegressor` initialization), although [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) does use this pattern).
- In `yaImpute`, the default behavior for building forests with floating-point features is to bin the feature into discrete classes rather than running random forest in regression mode.  Do we want to provide the same functionality with a hyperparmeter keyword (e.g. `build_classes=True`) that would implement this same behavior or should we just provide examples for users to pre-bin their data before setting up the estimator.

@aazuspan, lots of unresolved questions at this point and I would value your help thinking through the knotty data type questions before we start implementing this.  I think working with dataframes will be fairly straightforward by setting up the forests based on column names, but type discovery and variable mapping with numpy arrays seems trickier to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFNNRegressor should support both classifier and regressor RFs when building forests #87

Design considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFNNRegressor should support both classifier and regressor RFs when building forests #87

Description

Design considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions