From 3cb8622416af0d1f1b5f00cf0cbf6735f258644a Mon Sep 17 00:00:00 2001 From: Giorgos Stamatelatos Date: Tue, 3 Jul 2018 16:37:31 +0300 Subject: [PATCH] Enrich package-info.java --- .../java/gr/james/sampling/package-info.java | 102 +++++++++++++++--- 1 file changed, 89 insertions(+), 13 deletions(-) diff --git a/src/main/java/gr/james/sampling/package-info.java b/src/main/java/gr/james/sampling/package-info.java index 312b6cc..3472c84 100644 --- a/src/main/java/gr/james/sampling/package-info.java +++ b/src/main/java/gr/james/sampling/package-info.java @@ -1,20 +1,96 @@ /** * The package containing the utilities for random sampling. *

+ * Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of {@code k} items from a list + * {@code S} containing {@code n} items, where {@code n} is either a very large or unknown number. Typically {@code n} + * is large enough that the list doesn't fit into main memory. In this context, the sample of {@code k} items will be + * referred to as sample and the list {@code S} as stream. + *

+ * This package distinguishes these algorithms into two main categories: the ones that assign a weight in each item of + * the source stream and the ones that don't. These will be referred to as weighted and unweighted random sampling + * algorithms respectively. In unweighted algorithms, each item in the stream has probability {@code k/n} in appearing + * in the sample. In weighted algorithms this probability depends on the extra parameter weight. Each algorithm may + * interpret this parameter in a different way, for example in Weighted Random Sampling over Data Streams two + * possible interpretations are mentioned. + *

* The top level interfaces are {@link gr.james.sampling.RandomSampling} and * {@link gr.james.sampling.WeightedRandomSampling}, which represent unweighted and weighted random sampling algorithms - * respectively. - *

RandomSampling implementations

- * - *

WeightedRandomSampling implementations

- * + * respectively. The {@code WeightedRandomSampling} interface extends {@code RandomSampling} and, thus, weighted + * algorithms can be used in place as unweighted, usually with a performance penalty. + *

Properties

+ *

Complexity

+ * A fundamental principle of reservoir based sampling algorithms is that the memory complexity is linear in respect to + * the reservoir size. Furthermore, the sampling process is performed using a single pass of the stream. The amount of + * RNG invocations vary among the different implementations. + *

Precision

+ * Many implementations have an accumulating state which causes the precision of the algorithms to degrade as the stream + * becomes bigger. An example might be a variable state which strictly increases or decreases as elements are read from + * the stream. Because the implementations use finite precision data types (usually {@code double} or {@code long}), + * this behavior causes the precision of these implementations to degrade as the stream size increases. + *

Overflow

+ * Related to the concept of precision, overflow refers to the situation where the precision has degraded into a + * non-recurrent state that would prevent the algorithm from behaving consistently. In these cases the implementation + * will throw {@link gr.james.sampling.StreamOverflowException} to indicate this state. + *

Duplicates

+ * A {@code RandomSampling} algorithm does not keep track of duplicate elements because that would result in a linear + * memory complexity. Thus, it is valid to feed the same element multiple times in the same instance. For example it is + * possible to feed both {@code x} and {@code y}, where {@code x.equals(y)}. The algorithm will treat these items as + * distinct, even if they are reference-equals ({@code x == y}). As a result, the final sample + * {@link java.util.Collection} may contain duplicate elements. Furthermore, elements need not be immutable and the + * sampling process does not rely on the elements' {@code hashCode()} and {@code equals()} methods. + *

Implementations

+ * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + *
ImplementationAlgorithmSpacePrecisionWeighted
{@link gr.james.sampling.WatermanSampling}Algorithm R by Waterman{@code O(k)}DNO
{@link gr.james.sampling.VitterXSampling}Algorithm X by Vitter{@code O(k)}DNO
{@link gr.james.sampling.VitterZSampling}Algorithm Z by Vitter{@code O(k)}DNO
{@link gr.james.sampling.LiLSampling}Algorithm L by Li{@code O(k)}DNO
{@link gr.james.sampling.ChaoSampling}Algorithm by Chao{@code O(k)}DYES
{@link gr.james.sampling.EfraimidisSampling}Algorithm A-Res by Efraimidis{@code O(k)}NDYES
+ * + * @see Weighted Random Sampling over Data Streams */ package gr.james.sampling;