- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2k
Hypothesis Tests
Most interesting theories can’t be proven right, they can only be proven wrong. By continuously refuting alternatives, a theory becomes stronger (but most likely never reaching the ‘truth’). To perform a hypothesis test means verifying if a theory holds even when confronted with alternative theories. In statistical hypothesis testing, this often means checking if a hypothesis holds true even when confronted with the fact that it may have just happened to be true by pure chance or plain luck.
<< add an example of the Lady testing tea (https://en.wikipedia.org/wiki/Lady_tasting_tea)
<< describe what are contingency tables, linking their definitions to the ConfusionMatrix and GeneralConfusionMatrix classes >> << enumerate two examples of contingency tables (can be taken from ConfusionMatrix documentation >>
This is the second example from Wikipedia's page on hypothesis testing. In this example, a person is tested for clairvoyance (the ability to gain information about something through extra sensory perception; detecting something without using the known human senses.
    // A person is shown the reverse of a playing card 25 times and is
    // asked which of the four suits the card belongs to. Every time
    // the person correctly guesses the suit of the card, we count this
    // result as a correct answer. Let's suppose the person obtained 13
    // correctly answers out of the 25 cards.
    // Since each suit appears 1/4 of the time in the card deck, we 
    // would assume the probability of producing a correct answer by
    // chance alone would be of 1/4.
    // And finally, we must consider we are interested in which the
    // subject performs better than what would be expected by chance. 
    // In other words, that the person's probability of predicting
    // a card is higher than the chance hypothesized value of 1/4.
    BinomialTest test = new BinomialTest(
        successes: 13, trials: 25,
        hypothesizedProbability: 1.0 / 4.0,
        alternate: OneSampleHypothesis.ValueIsGreaterThanHypothesis);
    Console.WriteLine("Test p-Value: " + test.PValue);     // ~ 0.003
    Console.WriteLine("Significant? " + test.Significant); // True.
This is a common example with variations given by many sources. Some of them can be found here and here.
    // An insurance company is reviewing its current policy rates. When the
    // company initially set its rates, they believed the average claim amount
    // was about $1,800\. Now that the company is already up and running, the
    // executive directors want to know whether the mean is greater than $1,800.
    double hypothesizedMean = 1800;
    // Now we have two hypothesis. The null hypothesis (H0) is that there is no
    // difference between the initial set value of $1,800 and the average claim
    // amount of the population. The alternate hypothesis is that the average
    // is greater than $1,800.
    // H0 : population mean ≤ $1,800
    // H1 : population mean > $1,800
    OneSampleHypothesis alternate = OneSampleHypothesis.ValueIsGreaterThanHypothesis;
    // To verify those hypotheses, we draw 40 random claims and obtain a
    // sample mean of $1,950\. The standard deviation of claims is assumed
    // to be around $500.
    double sampleMean = 1950;
    double standardDev = 500;
    int sampleSize = 40;
    // Let's create our test and check the results
    ZTest test = new ZTest(sampleMean, standardDev,
        sampleSize, hypothesizedMean, alternate);
    Console.WriteLine("Test p-Value: " + test.PValue);      // ~0.03
    Console.WriteLine("Significant? " + test.Significant); // True.
    // In case we would like more information about what was calculated:
    Console.WriteLine("z statistic: " + test.Statistic);     // ~1.89736
    Console.WriteLine("std. error: " + test.StandardError); // 79.05694
    Console.WriteLine("test tail: " + test.Tail); // one Upper (right)
    Console.WriteLine("alpha level: " + test.Size); // 0.05</pre>
This example comes from Wikipedia's page on the F-test. Suppose we would like to study the effect of three different levels of a factor ona response (such as, for example, three levels of a fertilizer on plant growth. We have made 6 observations for each of the three levels a1, a2 and a3, and have written the results as in the table below.
    double[][] outcomes = new double[,]
    {
        // a1 a2 a3
        {  6,    8,  13 },
        {  8,   12,   9 },
        {  4,    9,  11 },
        {  5,   11,   8 },
        {  3,    6,   7 },
        {  4,    8,  12 },
    }
    .Transpose().ToArray();
    // Now we can create an ANOVA for the outcomes
    OneWayAnova anova = new OneWayAnova(outcomes);
    // Retrieve the F-test
    FTest test = anova.FTest;
    Console.WriteLine("Test p-value: " + test.PValue);   // ~0.002
    Console.WriteLine("Significant? " + test.Significant); // true
    // Show the ANOVA table
    DataGridBox.Show(anova.Table);
The last line in the example shows the ANOVA table using the framework's DataGridBox object. The DataGridBox is a convenience class for displaying DataGridViews just as one would display a message using MessageBox. The table is shown below: 
This example comes from the stats page of the College of Saint Benedict and Saint John's University (Kirkman, 1996). It is a very interesting example as it shows a case in which a t-test fails to see a difference between the samples because of the non-normality of the sample's distributions. The Kolmogorov-Smirnov nonparametric test, on the other hand, succeeds. The example deals with the preference of bees between two nearby blooming trees in an empty field. The experimenter has collected data measuring how much time does a bee spent near a particular tree. The time starts to be measured when a bee first touches the tree, and is stopped when the bee moves more than 1 meter far from it. The samples below represent the measured time, in seconds, of the observed bees for each of the trees.
    double[] redwell = 
    {
        23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3,
        14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 
        24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 
        24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3,
        1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 
        22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1,
        19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5 
    };
    double[] whitney = 
    {
        16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2,
        23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5,
        14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2,
        22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1,
        6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4,
        23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6,
        39.1, 26.5, 22.7
    };
    // Create a t-test as a first attempt.
    var t = new TwoSampleTTest(redwell, whitney);
    Console.WriteLine("T-Test");
    Console.WriteLine("Test p-value: " + t.PValue);    // ~0.837
    Console.WriteLine("Significant? " + t.Significant); // false
    // Create a non-parametric Kolmogovor Smirnov test
    var ks = new TwoSampleKolmogorovSmirnovTest(redwell, whitney);
    Console.WriteLine("KS-Test");
    Console.WriteLine("Test p-value: " + ks.PValue);    // ~0.038
    Console.WriteLine("Significant? " + ks.Significant); // true</pre>
The last example comes from (E. Ientilucci, 2006) and deals with comparing the performance of two different raters (classifiers) to see if their performance are significantly different. Suppose an experimenter has two classification systems, both trained to classify observations into one of 4 mutually exclusive categories. In order to measure the performance of each classifier, the experimenter confronted their classification labels with the ground truth for a testing dataset, writing the respective results in the form of contingency tables. The hypothesis to be tested is that the performance of the two classifiers is the same.
    // Create the confusion matrix for the first sytem.
    var a = new GeneralConfusionMatrix(new int[,]]]
    {
        { 317,  23,  0,  0 },
        {  61, 120,  0,  0 },
        {   2,   4, 60,  0 },
        {  35,  29,  0,  8 },
    });
    // Create the confusion matrix for the second system.
    var  b = new GeneralConfusionMatrix(new int[,]
    {
        { 377,  79,  0,  0 },
        {   2,  72,  0,  0 },
        {  33,   5, 60,  0 },
        {   3,  20,  0,  8 },
    });
    var test = new TwoMatrixKappaTest(a, b);
    Console.WriteLine("Test p-value: " + test.PValue);    // ~0.628
    Console.WriteLine("Significant? " + test.Significant); // false
In this case, the test didn't show enough evidence to confidently reject the null hypothesis. Therefore, one should restrain from affirming anything about differences between the two systems, unless the power for the test is known. Unfortunately, I could not find a clear indication in the literature about the power of a two matrix Kappa test. However, since the test statistic is asymptotically normal, one would try checking the power for this test by analysis the power of the underlying Z-test. If there is enough power, one could possibly accept the null hypothesis that there are no large differences between the two systems.
As always, I expect the above discussion and examples could be useful for interested readers and users. However, if you believe you have found a flaw or would like to discuss any portion of this post, please feel free to do so by posting on the issues tracker.
Help improve this wiki! Those pages can be edited by anyone that would like to contribute examples and documentation to the framework.
Have you found this software useful? Consider donating only U$10 so it can get even better! This software is completely free and will always stay free. Enjoy!
