The understanding of how much an object is similar to another object is a common task in our daily lives. The amount of knowledge we can gain from realizing these similarities between different objects can give us great insights on the problems we are dealing with.
The min-Hash algorithm was introduced as an efficient algorithm in time and space manners for calculating similarity between sets. Calculating resemblance and containment of documents using min-Hash was introduced by Broder [1] in 1997, and [2] in 2000, Broder focused on specific part of it, which is the min-wise permutations, that was essential to the algorithm of AltaVista web index for finding similar web pages from a huge collection of web pages. An index as AltaVista doesn't need unnecessary duplications of such amount of documents, and an efficient algorithm was required to find duplicates or near-duplicates documents.
Executing the example:
python main.py
Omri Lahav
- E-mail: [email protected]
- Linked-in: https://www.linkedin.com/in/omri-lahav-a89b1957
Copyright (C) 2017 Omri Lahav ([email protected])
All rights reserved.
This software can be used free of charge. Please cite and reference.
- [1] A. Z. Broder, “On the Resemblance and Containment of Documents,” Proc. Compression Complex. Seq. 1997, pp. 21–29, 1997.
- [2] A. Z. Broder, “Min-wise independent permutations: Theory and practice,” Autom. Lang. Program., vol. 1853, p. 808, 2000.