Dealing with missing values

It's an important milestone that the latest version allows for **missing values**.  

Nonetheless, some related issues remain: 

1. How are missing values treated when **evaluating (individual) cue validities**?  
The current version simply ignores missing values.  However, this could yield undeserved high value to cues with many missing values. Thus, a more cautious alternative could classify cases according to the criterion's baseline probabilities when classifying cases with missing values.  An imaginary cue with ALL values missing would then perform at baseline level — and any improvement beyond this would be due to actual cue information. 

2. How are missing values treated when **applying an FFT** to data?
In the current version, nodes with missing data are simply skipped.  This seems fine, but what when the cue data of a final node is missing?  Again, classification according to the baseline probabilities seems an obvious choice (as is what presently happens). 

Another (and complementary) way to deal with many missing values is to build larger FFTs.  As long as the cues at each node perform better than baseline, this should increase the likelihood that a case can be classified at some node.

3. More **advanced issues** include:

- What happens when _new_ variables or levels of variables/factors appear during testing that have not been encountered before (during training)?
- What about _imputing_ missing values from existing data?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dealing with missing values #28

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Dealing with missing values #28

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions