Skip to content

Commit

Permalink
package-info.java citations
Browse files Browse the repository at this point in the history
  • Loading branch information
gstamatelat committed Jul 3, 2018
1 parent 7b2f7c7 commit 7ba81ec
Show file tree
Hide file tree
Showing 2 changed files with 67 additions and 26 deletions.
21 changes: 21 additions & 0 deletions src/javadoc/stylesheet.css
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,24 @@ table.table tbody tr:nth-child(2n) {
table.table th {
text-align: left;
}

/* Lists */

ol.citations {
margin: 0;
padding: 0;
list-style-type: none;
}

ol.citations li {
counter-increment: citations-counter;
}

ol.citations li:before {
content: "[" counter(citations-counter) "]";
margin-right: 0.5em;
}

ol.citations li {
margin-bottom: 0.2em;
}
72 changes: 46 additions & 26 deletions src/main/java/gr/james/sampling/package-info.java
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,35 @@
* referred to as sample and the list {@code S} as stream.
* <p>
* This package distinguishes these algorithms into two main categories: the ones that assign a weight in each item of
* the source stream and the ones that don't. These will be referred to as weighted and unweighted random sampling
* algorithms respectively. In unweighted algorithms, each item in the stream has probability {@code k/n} in appearing
* in the sample. In weighted algorithms this probability depends on the extra parameter weight. Each algorithm may
* interpret this parameter in a different way, for example in <b>Weighted Random Sampling over Data Streams</b> two
* possible interpretations are mentioned.
* <p>
* The top level interfaces are {@link gr.james.sampling.RandomSampling} and
* {@link gr.james.sampling.WeightedRandomSampling}, which represent unweighted and weighted random sampling algorithms
* respectively. The {@code WeightedRandomSampling} interface extends {@code RandomSampling} and, thus, weighted
* algorithms can be used in-place as unweighted, usually with a performance penalty due to the extra weight-related
* overhead.
* the source stream (interface {@link gr.james.sampling.WeightedRandomSampling}) and the ones that don't
* (interface {@link gr.james.sampling.RandomSampling}). In unweighted algorithms, each item in the stream has
* probability {@code k/n} in appearing in the sample. As a result, they have equivalent behavior and are differentiated
* only on their performance characteristics. In weighted algorithms this probability depends on the extra
* {@code weight} parameter (see <em><a href="#weights">weights</a></em> for more details). The
* {@code WeightedRandomSampling} interface extends {@code RandomSampling} and, thus, weighted algorithms can be used
* in-place as unweighted, usually with a performance penalty due to the extra weight-related overhead.
* <h3>Properties</h3>
* <h4>Complexity</h4>
* A fundamental principle of reservoir based sampling algorithms is that the memory complexity is linear in respect to
* the reservoir size {@code O(k)}. Furthermore, the sampling process is performed using a single pass of the stream.
* The amount of RNG invocations vary among the different implementations.
* <h4>Duplicates</h4>
* A {@code RandomSampling} algorithm does not keep track of duplicate elements because that would result in a linear
* memory complexity. Thus, it is valid to feed the same element multiple times in the same instance. For example it is
* possible to feed both {@code x} and {@code y}, where {@code x.equals(y)}. The algorithm will treat these items as
* distinct, even if they are reference-equals ({@code x == y}). As a result, the final sample
* {@link java.util.Collection} may contain duplicate elements. Furthermore, elements need not be immutable and the
* sampling process does not rely on the elements' {@code hashCode()} and {@code equals()} methods.
* <h4 id="weights">Weights</h4>
* The interpretation of the weight may be different for each {@code WeightedRandomSampling} implementation. For
* example, in [1] two possible interpretations are mentioned. In the first case, the probability of an item to be in
* the final sample is proportional to its relative weight (implemented in {@code ChaoSampling}). In the second, the
* relative weight determines the probability that the item is selected in each of the explicit or implicit item
* selections of the sampling procedure (implemented in {@code EfraimidisSampling}). As a result, implementations of
* this interface may not exhibit identical behavior, as opposed to the {@link gr.james.sampling.RandomSampling}
* interface. The contract of this interface is, however, that a higher weight value suggests a higher probability for
* an item to be included in the sample. Implementations may also define certain restrictions on the values of
* {@code weight} and violations will result in {@link gr.james.sampling.IllegalWeightException}.
* <h4>Precision</h4>
* Many implementations have an accumulating state which causes the precision of the algorithms to degrade as the stream
* becomes bigger. An example might be a variable state which strictly increases or decreases as elements are read from
Expand All @@ -32,13 +45,6 @@
* Related to the concept of precision, overflow refers to the situation where the precision has degraded into a
* non-recurrent state that would prevent the algorithm from behaving consistently. In these cases the implementation
* will throw {@link gr.james.sampling.StreamOverflowException} to indicate this state.
* <h4>Duplicates</h4>
* A {@code RandomSampling} algorithm does not keep track of duplicate elements because that would result in a linear
* memory complexity. Thus, it is valid to feed the same element multiple times in the same instance. For example it is
* possible to feed both {@code x} and {@code y}, where {@code x.equals(y)}. The algorithm will treat these items as
* distinct, even if they are reference-equals ({@code x == y}). As a result, the final sample
* {@link java.util.Collection} may contain duplicate elements. Furthermore, elements need not be immutable and the
* sampling process does not rely on the elements' {@code hashCode()} and {@code equals()} methods.
* <h3>Implementations</h3>
* <table class="table" summary="">
* <thead>
Expand All @@ -53,49 +59,63 @@
* <tbody>
* <tr>
* <td>{@link gr.james.sampling.WatermanSampling}</td>
* <td>Algorithm R by Waterman</td>
* <td>Algorithm R by Waterman [2]</td>
* <td>{@code O(k)}</td>
* <td>D</td>
* <td>NO</td>
* </tr>
* <tr>
* <td>{@link gr.james.sampling.VitterXSampling}</td>
* <td>Algorithm X by Vitter</td>
* <td>Algorithm X by Vitter [3]</td>
* <td>{@code O(k)}</td>
* <td>D</td>
* <td>NO</td>
* </tr>
* <tr>
* <td>{@link gr.james.sampling.VitterZSampling}</td>
* <td>Algorithm Z by Vitter</td>
* <td>Algorithm Z by Vitter [3]</td>
* <td>{@code O(k)}</td>
* <td>D</td>
* <td>NO</td>
* </tr>
* <tr>
* <td>{@link gr.james.sampling.LiLSampling}</td>
* <td>Algorithm L by Li</td>
* <td>Algorithm L by Li [4]</td>
* <td>{@code O(k)}</td>
* <td>D</td>
* <td>NO</td>
* </tr>
* <tr>
* <td>{@link gr.james.sampling.ChaoSampling}</td>
* <td>Algorithm by Chao</td>
* <td>Algorithm by Chao [5][6]</td>
* <td>{@code O(k)}</td>
* <td>D</td>
* <td>YES</td>
* </tr>
* <tr>
* <td>{@link gr.james.sampling.EfraimidisSampling}</td>
* <td>Algorithm A-Res by Efraimidis</td>
* <td>Algorithm A-Res by Efraimidis [7]</td>
* <td>{@code O(k)}</td>
* <td>ND</td>
* <td>YES</td>
* </tr>
* </tbody>
* </table>
*
* @see <a href="https://doi.org/10.1007/978-3-319-24024-4_12">Weighted Random Sampling over Data Streams</a>
* <h3>References</h3>
* <ol class="citations">
* <li><a href="https://doi.org/10.1007/978-3-319-24024-4_12">Efraimidis, Pavlos S. "Weighted random sampling over data
* streams." Algorithms, Probability, Networks, and Games. Springer International Publishing, 2015. 183-195.</a></li>
* <li>The Art of Computer Programming, Vol II, Random Sampling and Shuffling</li>
* <li><a href="https://doi.org/10.1145/3147.3165">Vitter, Jeffrey S. "Random sampling with a reservoir."
* ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57.</a></li>
* <li><a href="https://doi.org/10.1145/198429.198435">Li, Kim-Hung. "Reservoir-sampling algorithms of time complexity
* O(n(1+log(N/n)))." ACM Transactions on Mathematical Software (TOMS) 20.4 (1994): 481-493.</a></li>
* <li><a href="https://doi.org/10.2307/2336002">Chao, M. T. "A general purpose unequal probability sampling plan."
* Biometrika 69.3 (1982): 653-656.</a></li>
* <li><a href="https://doi.org/10.1080/02664769624152">Sugden, R. A. "Chao's list sequential scheme for unequal
* probability sampling." Journal of Applied Statistics 23.4 (1996): 413-421.</a></li>
* <li><a href="https://doi.org/10.1016/j.ipl.2005.11.003">Efraimidis, Pavlos S., and Paul G. Spirakis. "Weighted random
* sampling with a reservoir." Information Processing Letters 97.5 (2006): 181-185.</a></li>
* </ol>
*/
package gr.james.sampling;

0 comments on commit 7ba81ec

Please sign in to comment.