Skip to content

Dealing with missing valuesΒ #28

@hneth

Description

@hneth

It's an important milestone that the latest version allows for missing values.

Nonetheless, some related issues remain:

  1. How are missing values treated when evaluating (individual) cue validities?
    The current version simply ignores missing values. However, this could yield undeserved high value to cues with many missing values. Thus, a more cautious alternative could classify cases according to the criterion's baseline probabilities when classifying cases with missing values. An imaginary cue with ALL values missing would then perform at baseline level β€” and any improvement beyond this would be due to actual cue information.

  2. How are missing values treated when applying an FFT to data?
    In the current version, nodes with missing data are simply skipped. This seems fine, but what when the cue data of a final node is missing? Again, classification according to the baseline probabilities seems an obvious choice (as is what presently happens).

Another (and complementary) way to deal with many missing values is to build larger FFTs. As long as the cues at each node perform better than baseline, this should increase the likelihood that a case can be classified at some node.

  1. More advanced issues include:
  • What happens when new variables or levels of variables/factors appear during testing that have not been encountered before (during training)?
  • What about imputing missing values from existing data?

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions