Jaccard Similarity
The Conceptual Process
- Shingling: Convert documents into sets of short strings (e.g., 3-word phrases).
- Permutation: Imagine taking all possible shingles and randomly shuffling their order.
- The "Min" Hash: The MinHash value for a document is the index of the first shingle that appears in that document after the shuffle.
The probability that two sets have the same MinHash value is exactly equal to their Jaccard Similarity:
Benefit
- Efficiency: we can compare two 1MB documents by just comparing 100 integers (their MinHash signature).
- Scalability: It is often paired with Locality Sensitive Hashing (LSH) to find similar items in sub-linear time, meaning we don't have to check every single pair in a database.
- Storage: only need to store the compact signatures, not the full text of the documents.
Real-World Applications
- Search Engines: Detecting "mirror" websites or slightly modified versions of the same article.
- Genomics: Comparing DNA sequences to find similar species or genes.
- Plagiarism Detection: Finding overlapping blocks of text across a massive database of student essays.
#include <iostream>
#include <vector>
#include <set>
#include <algorithm>
#include <limits>
using namespace std;
// A simple hash function to simulate different permutations
// h(x) = (a * x + b) % c
uint32_t hash_func(uint32_t x, uint32_t a, uint32_t b, uint32_t c) {
return (a * x + b) % c;
}
int main() {
// 1. Represent two documents as sets of "shingles" (hashed strings)
set<uint32_t> doc1 = {10, 25, 30, 45, 50};
set<uint32_t> doc2 = {10, 25, 31, 46, 50}; // Mostly similar
// 2. Define parameters for our MinHash signatures
int num_hashes = 100;
uint32_t large_prime = 1000003;
// Coefficients for our hash functions (randomly chosen for this example)
vector<uint32_t> a_coeffs(num_hashes), b_coeffs(num_hashes);
for(int i = 0; i < num_hashes; i++) {
a_coeffs[i] = rand() % 500 + 1;
b_coeffs[i] = rand() % 500 + 1;
}
// 3. Generate Signatures
vector<uint32_t> sig1(num_hashes, numeric_limits<uint32_t>::max());
vector<uint32_t> sig2(num_hashes, numeric_limits<uint32_t>::max());
for (int i = 0; i < num_hashes; i++) {
// Find minimum hash for doc1
for (uint32_t shingle : doc1) {
sig1[i] = min(sig1[i], hash_func(shingle, a_coeffs[i], b_coeffs[i], large_prime));
}
// Find minimum hash for doc2
for (uint32_t shingle : doc2) {
sig2[i] = min(sig2[i], hash_func(shingle, a_coeffs[i], b_coeffs[i], large_prime));
}
}
// 4. Estimate Similarity
int matches = 0;
for (int i = 0; i < num_hashes; i++) {
if (sig1[i] == sig2[i]) matches++;
}
double estimated_jaccard = (double)matches / num_hashes;
cout << "Estimated Jaccard Similarity: " << estimated_jaccard << endl;
return 0;
}
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.