Skip to content

omrilahav/MinHash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Min-Hash Signatures Generator

The understanding of how much an object is similar to another object is a common task in our daily lives. The amount of knowledge we can gain from realizing these similarities between different objects can give us great insights on the problems we are dealing with.

The min-Hash algorithm was introduced as an efficient algorithm in time and space manners for calculating similarity between sets. Calculating resemblance and containment of documents using min-Hash was introduced by Broder [1] in 1997, and [2] in 2000, Broder focused on specific part of it, which is the min-wise permutations, that was essential to the algorithm of AltaVista web index for finding similar web pages from a huge collection of web pages. An index as AltaVista doesn't need unnecessary duplications of such amount of documents, and an efficient algorithm was required to find duplicates or near-duplicates documents.

Getting Started

Executing the example:

python main.py

Authors

Omri Lahav

License

Copyright (C) 2017 Omri Lahav ([email protected])

All rights reserved.

This software can be used free of charge. Please cite and reference.

References

  • [1] A. Z. Broder, “On the Resemblance and Containment of Documents,” Proc. Compression Complex. Seq. 1997, pp. 21–29, 1997.
  • [2] A. Z. Broder, “Min-wise independent permutations: Theory and practice,” Autom. Lang. Program., vol. 1853, p. 808, 2000.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages